Re: thoughts on memory usage...

From: David Luyer <luyer@dont-contact.us>
Date: Wed, 20 Aug 1997 20:08:31 +0800 (WST)

--MimeMultipartBoundary
Content-Type: TEXT/PLAIN; charset=US-ASCII

[note: I've been CC:ing michael@ii.net on these discussions since he first
mentioned the idea of compressing URL strings, and I'm not sure if he's on
squid-dev or not]

[note on note: squid-dev isn't expn-able and squid-dev-request isn't
automated... is there an easy way to work out whose on squid-dev?
expn-ing squid-dev-outgoing/squid-dev-dist doesn't work either....]

On Wed, 20 Aug 1997, Andres Kroonmaa wrote:
> If hash.c is good enough, we'd need no entry->url altogether, successful
> hash lookup means we have have an url hit. Request URL would be kept in ram
> only as long as request is being serviced.

1) no hash is perfect, or it's a compression technique (even if it is
extremely hard to reverse :). even if the md5sums of two url strings are
the same, we'd still have to lookup the url in the on-disk file to be
sure.

2) every ICP query would require in this case an md5 encoding, an open()
of an on-disk file to verify the true url, etc. giving an ICP response
for the wrong URL and then realising it later is really unaccaptable in a
peering hierachy, esp if you don't allow miss_access and/or you have a
slow upstream link.

Using a primitive compression (like the one I've described or better, or
using slf's scheme where you're essentially "compressing" a URL into a
4-byte index which is really a key into a tree structure) gives you a
guarenteed, reversible and one-to-one mapping for any URL. This means
that you don't have to store the real URL anywhere (except in the on-disk
log file if you plan on changing the compression technique between
versions; of course this aspect could also be achieved with
backwards-compatibility logfile-reading routines or a conversion script,
with the encoded URLs stored on disk).

Once I ironed all the bugs out of my encoding method it encodes 65.5M of
URLs (with trailing NUL) into 44.3M of space (32.4% saving). This is with
a

David.

--MimeMultipartBoundary--
Received on Tue Jul 29 2003 - 13:15:42 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:23 MST