Re: [squid-users] Reverse Proxy and Googlebot

From: Simon Waters <simonw_at_zynet.net>
Date: Mon, 6 Oct 2008 12:15:50 +0100

On Monday 06 October 2008 11:55:41 Amos Jeffries wrote:
> Simon Waters wrote:
> > Seeing issues with Googlebots retrying on large PDF files.
> >
> > Apache logs a 200 for the HTTP 1.0 requests.
> >
> > Squid logs an HTTP 1.1 request that looks to have stopped early (3MB out
> > of 13MB).
> >
> > This pattern is repeated with slight variation in the amount of data
> > served to the Googlebots, and after about 14 attempts it gives up and
> > goes away.
> >
> > Anyone else seeing same?
>
> Not seeing this, but.... do you have correct Expires: and Cache-Control
> headers on those .pdf? and is GoogleBot not obeying them?

Yes Etags and Expires headers - I don't think this is Squid specific since I
saw similar from Googlebots before there was a reverse proxy involved.

Does have a "Vary: Host" header, I know how it got there but I'm not 100% sure
what if any effect it has on caching, I'm hoping everything is ignoring it.
Again may be relevant in general, but shouldn't be relevant to this request
(since it is all from the same host).

http://groups.google.com/group/Google_Webmaster_Help-Indexing/browse_thread/thread/f8ecc41ac9e5bc11

I just thought because there is a Squid reverse proxy in front of the server I
had more information on what was going wrong, and that others here might have
seen something similar.

It looks like the Googlebot is timing out, and retrying. Quite why it is not
getting the cache is unclear at this point, but since I can't control the
Googlebot I can't reproduce with more logging. It also doesn't seem to back
off any when it fails, which I think is the real issue here. Google showed
some interest last time, but never got back to me.

I got TCP_MISS:FIRST_UP_PARENT logged on squid for all these requests.
Today when I checked the headers using wget I see
TCP_REFRESH_HIT:FIRST_UP_PARENT, and TCP_HIT:NONE, so Squid seems to be doing
something sensible with the file usually, just Googlebots it dislikes.

Would you expect Squid to cache the first 3MB if the HTTP 1.1 request stopped
early?

66.249.67.185 - - [01/Oct/2008:08:37:13 -0700] "GET http://somewhere.pdf
HTTP/1.1" 200 3596968 TCP_MISS:FIRST_UP_PARENT
66.249.67.185 - - [01/Oct/2008:08:47:34 -0700] "GET http://somewhere.pdf
HTTP/1.1" 200 3342120 TCP_MISS:FIRST_UP_PARENT
66.249.67.185 - - [01/Oct/2008:08:53:47 -0700] "GET http://somewhere.pdf
HTTP/1.1" 200 4106664 TCP_MISS:FIRST_UP_PARENT
66.249.67.185 - - [01/Oct/2008:08:59:51 -0700] "GET http://somewhere.pdf
HTTP/1.1" 200 3973448 TCP_MISS:FIRST_UP_PARENT
66.249.67.185 - - [01/Oct/2008:09:06:12 -0700] "GET http://somewhere.pdf
HTTP/1.1" 200 3762040 TCP_MISS:FIRST_UP_PARENT
66.249.67.185 - - [01/Oct/2008:09:12:35 -0700] "GET http://somewhere.pdf
HTTP/1.1" 200 3843128 TCP_MISS:FIRST_UP_PARENT
66.249.67.185 - - [01/Oct/2008:09:18:46 -0700] "GET http://somewhere.pdf
HTTP/1.1" 200 3206008 TCP_MISS:FIRST_UP_PARENT
66.249.67.185 - - [01/Oct/2008:09:25:00 -0700] "GET http://somewhere.pdf
HTTP/1.1" 200 2958400 TCP_MISS:FIRST_UP_PARENT
66.249.67.185 - - [01/Oct/2008:09:31:25 -0700] "GET http://somewhere.pdf
HTTP/1.1" 200 3659232 TCP_MISS:FIRST_UP_PARENT
66.249.67.185 - - [01/Oct/2008:09:37:59 -0700] "GET http://somewhere.pdf
HTTP/1.1" 200 3643304 TCP_MISS:FIRST_UP_PARENT
66.249.67.185 - - [01/Oct/2008:09:44:35 -0700] "GET http://somewhere.pdf
HTTP/1.1" 200 3950280 TCP_MISS:FIRST_UP_PARENT
66.249.67.185 - - [01/Oct/2008:09:50:44 -0700] "GET http://somewhere.pdf
HTTP/1.1" 200 2182272 TCP_MISS:FIRST_UP_PARENT
66.249.67.185 - - [01/Oct/2008:09:57:16 -0700] "GET http://somewhere.pdf
HTTP/1.1" 200 4154448 TCP_MISS:FIRST_UP_PARENT
Received on Mon Oct 06 2008 - 11:15:55 MDT

This archive was generated by hypermail 2.2.0 : Mon Oct 06 2008 - 12:00:02 MDT