Re: html prefetching from Andres Kroonmaa on 2000-06-05 (squid-dev)

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Tue, 6 Jun 2000 00:29:03 +0200

On 3 Jun 2000, at 11:16, Alex Rousskov <rousskov@ircache.net> wrote:

> On Sat, 3 Jun 2000, Daniel O'Callaghan wrote:
>
> > > It would not be a trival hack. Deciding which
> > > objects to prefetch is somewhat complicated I think.
> >
> > Surely a large amount of benefit could be had just by limiting the
> > prefetch to <IMG> tags.
>
> To me, there is only one sure thing about prefetching: not all
> "embedded" objects are requested by the browser when rendering a page.

all your arguments are all serious and I agree with you on most part

> Lots of factors, including browser cache and javascript funnies, make
> the assumption that every <img> tag should be prefetched false.

definitely this should be investigated.
I think noone really means to prefetch _all_ referenced objects.
prefetch algoritm should be tuned to be optimal, whatever that means.
For eg, I think that it is pretty safe to prefetch .gif and .jpg
images as they are very static and even if not consumed immediately,
can be useful at further accesses to the same site.

> We can spend 10 more e-mails arguing about the percentage of images that
> are worth prefetching.
;) seems like I've already started. perhaps this arguing would give
better understanding of whether to, and if yes, then what to prefetch?

> Those who believe that the answer a sure YES,
> may want to ask themselves a simple question: how come nobody but
> CacheFlow has implemented and promoting that feature?
Well, all your other arguments are very serious, but not this one.
If we always asked ourselves such questions, I'd be still climbing
the trees, I guess ;P Someone is always the first, does this by
default mean that they must be wrong? Or was that a tricky question?

> Clearly, the
> prefetching algorithm is straightforward and could have been implemented
> by virtually any cache vendor if it was indeed a speed-for-nothing
> solution. The only constructive way to end the discussion would be to
> _demonstrate_ that prefetching works (or does not work).

Sometimes I'm really amused why such discussions are so often wanted
to be _ended_ before even being started. Whats so wrong with such
discussions before jumping into "constructive demonstration" mode?

> The simplest yet reliable way to demonstrate the usefulness of
> prefetching may be to write a small program that counts the number of
> successful prefetches using standard Squid access logs and sampling
> technique to retrieve HTML pages. Squid logs + html pages contain enough
> information to see how many of the embedded objects were actually
> requested by the client shortly after the container page was served.

OK, I tried to do that on real logs. Simple, but not very reliable.
For demonstration purposes we want to reach some average measure of
usefulness of prefetching, and based on that average decide whether
it is worth doing or not. Unfortunately, there are very many factors
that shift averages and influence purity of tests, that it is quite
difficult to pick whats meaningful.
For eg. user didn't wait till the end of full page fetch, because it
took more than 30secs to complete. Can you blame user? so page fetch
was stopped and you don't have all the component pages in logs.
Well, in a sort of way we can say that we have conserved lots of
bandwidth ;), can't we? In this sense, it is very bad to make squid
prefetch. But this is not a pure indication of whether prefetching
works. Perhaps if user got his page fully in time, he would have
stayed longer on that page and site?
As to analysis, it is very difficult to account for such aborts,
as it is not known whether user stopped fetching due to impatience
or because his browser didn't intended to fetch those components.

http://www.online.ee/~andre/squid/prefetch/
There is a quick and dirty try to analyse usefulness of prefetch.
You'll see that it is dirty. It fails to cleanly detect pure
image object references and at times extracts javascript references,
which should be handled as a special case, I believe.
There are many errors that reduce the usefullness ratio, and also
several deficiencies that miss actual fetches by browser.
But this is preliminary test. I plan to run a more isolated test
with some special list of URLs and dedicated browser client.
Perhaps you have some list of URLs I should specifically add?

please comment the test.

> Another important factor to measure would be the savings in response
> time we would get by starting prefetching the objects earlier (using
> recorded response times for embedded objects).

This is quite difficult to do. To get real results, we'd actually
need to implement prefetching. Otherwise we'd need to make lots of
assumptions that may not hold true. For eg. we can assume that all
browser-requested objects were started at the same time. Then we'd
need to assume that client consumes all objects at high speed, or
we need to account for the speed limit in estimating total page
fetching time.
It most probably does not make sense to prefetch if client
downloads components longer than it takes to fetch them in sequence,
like slow dialup client fetching page from fast nearby server.
But for fast client, fetching page from remote server with high rtt,
this may really help alot.

> Given such a tool, people can run it against their logs and get a
> reasonable estimate for their environment.
>
> Clearly, anybody is free to implement prefetching without validating its
> usefulness first. If the implementation ever makes to the official Squid
> code, it should not be enabled by default, of course.
of course.

> There are also many big non-performance question here. Let's assume that
> most content providers do not mind proxies prefetching images (we
> already know that this is not 100% true). Let's also assume that a proxy
I've seen several mentions of some kind of conflict, but no story.
Would you like to describe that in few words?

> Now, imagine a "custom" content provider that generates HTML-looking
> pages for some custom clients (which, for example, retrieve just one of
> the 1000 images embedded in the the page). What would you do about it?

hmm, quite extreme. I imagine that squid would not look like this custom
client to the remote server. And if they generate web page that may
instruct non-custom web browser to retrieve all the images, then they
are nuts.
What we can do about it? don't prefetch more than say 10 images at a
time, and prefetch more if browser actually consumes those prefetched.
We can enable prefetching only for known client browsers.

> Also, imagine that due to the client functionality and page layout,
> requesting an image actually means something like acknowledgment to buy
> a product. Will you reimburse the customers for purchases they did not
> really make?

well well, even more extreme ;) I think customer should sue them, and win ;)

> To summarize the non-performance section: For good or bad, content
> providers like HTTP for its simplicity and general support. By design,
> these providers assume they are talking to end-clients. Proxies must be
> as "transparent" as possible with the relation to the semantics of the
> exchange (which is ultimately defined by the content provider and the
> user). Otherwise, the providers may have to switch to different,
> proprietary protocols which you will not be able to proxy, at your and
> your users expense...

Strictly, yes. But there are only about 2-3 different browsers around
that make up over 90% of requests. Their behaviour is known and should
be easy to follow.
I'd imagine prefetcher should build a list of potentially prefetchable
objects, but not start actual prefetching of them until browser
requests at least few of them. Then we can decide that we have potentially
correct estimate of what browser will request and start prefetching.
We may limit number of prefetched objects to some max, and if we detect
browser aborting download we can also stop prefetching immediately.
We can also drop prefetching for this page if we find prediction
correctness to be below some threshold.
Depending on intelligence of prefetcher we can decide either not to
include javascript and other funnies, or if and when we are confident
we can include these also.

But I think that to stay of safe ground, we should limit prefetching
to only static gifs and jpgs that make up page face. This alone could
save quite abit of loading time for usual surfer.
And as it is not so hard to behave like known browsers, I think we
can stay pretty much transparent.

------------------------------------
Andres Kroonmaa <andre@online.ee>
Network Development Manager
Delfi Online
Tel: 6501 731, Fax: 6501 708
Pärnu mnt. 158, Tallinn,
11317 Estonia
Received on Mon Jun 05 2000 - 16:30:34 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:28 MST