Re: Archive an extra copy of documents

From: Robert Collins <robertc@dont-contact.us>
Date: 25 Aug 2002 23:54:05 +1000

On Fri, 2002-08-23 at 14:11, Brad Tofel wrote:
> Hi all,

Hi!

> My thinking is to have all the requests to the live web from the CGI
> machines go through squid, and then to modify squid so that in addition to
> it's standard cache, documents are also written to ARC files (our very
> simple archival format: a metadata line with the URL, the remote IP, a
> timestamp, mime-type and document-length, followed by the document
> itself[which includes HTTP headers]. Append more metadata-document chunks
> until your file is 100MB, and that's an ARC file.) As ARC files are
> generated, we'd pull them off the squid host and move them into the main
> archive.

This sounds -very- similar to adrians' COSS squid filesystem, with the
difference being that rather than recycle the COSS store, you could
rotate to a new one. The neat thing about that is that you could
potentially recycle a lot of code.
 
> We only want to append documents to our ARC file as they are actually
> downloaded from the live web(the cache missed or the cached version was too
> old.)
>
> We'll only be using squid for HTTP traffic, so at first blush it seems that
> we want to put our code into http.c, where there is the smarts to know that
> an HTTP connection has just completed successfully, but of course this is a
> simplistic view of some complex code. Another good choice seems to be in
> store.c or store_io.c, but I'm confused by which functions are being used
> when a "cache-retrieval completes", versus a "download-to-cache" completes.
> I've looked thru the online docs, but haven't found anything yet that's
> giving me much traction.

Yeah, this is a common point of confusion. Squid's current store doesn't
actually perform the IMS or revalidation logic - client_side does that.
 
> Can someone with a good understanding of the code give me a head-start on a
> good approach to implementing this, or a pointer to the right documentation
> that I've missed? Is squid overkill for what we're trying to do?

Squid can do what you need with some careful hacking. IMO the best place
with the current code to do what you need is in the client_side chain of
functions that begins with clientProcessRequest. In there you will be
able to determine if the object is one you are interested in archiving
off to your store, and capture all the data as it goes through squid. Be
sure to disable doing that for range replies though!. Lastly, be very
sure to use the async io if you need performance to stay reasonable.
Doubling the amount of disk writes squid has to do will have a
performance impact, especially if you use blocking IO. Squid has a
framework for async diskio but without checking I can't tell you if it's
usable outside the store routines).
 
> If there's anyone interested in helping with this customization, that would
> be fantastic, too! My guess is that the right person could whip this out in
> a matter of hours.

Probably a day with testing and potentially some refactoring for disk IO
code. Looking at the disk IO code is somewhere on my personal TODO list
regardless, but I couldn't say when (sorry!). There are folk here (and
I'm one of them) who will code-for-sponsorship, if you have a need for
the code in the short term. I will also happily discuss it indefinately
with you to assist your own code-cutting efforts.

Cheers,
Rob

Received on Sun Aug 25 2002 - 07:53:56 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:16:08 MST