Re: [RFC] byte hit ratio

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Wed, 08 Feb 2012 01:00:36 +1300

On 7/02/2012 9:40 p.m., Henrik Nordström wrote:
> tis 2012-02-07 klockan 14:01 +1300 skrev Amos Jeffries:
>> We have a long history of questions and bugs mentioning negative
>> numbers in the byte hit ratio.
>>
>> I've always thought it was a bug we had not tracked down, but the FAQ
>> says it is correct.
>> http://wiki.squid-cache.org/SquidFaq/InnerWorkings#Why_do_I_see_negative_byte_hit_ratio.3F
> Yes.. it's based on the difference between traffic squid<-servers and
> clients<-squid. This can be negative (more traffic squid<-servers than
> clients<-squid) in some situations.
>
> - retried requests
> - range retreival being processed by Squid
> - continued download after client disconnects (quick_abort_...)

Wiki also mentions cache digests but ...
" /*
      * This ugly hack is here to prevent the user from seeing a
      * negative byte hit ratio. When we fetch a cache digest from
      * a neighbor, it gets treated like a cache miss because the
      * object is consumed internally. Thus, we subtract cache
      * digest bytes out before calculating the byte hit ratio.
      */
     cd = CountHist[0].cd.kbytes_recv.kb -
CountHist[minutes].cd.kbytes_recv.kb;
"

>> I've discussed this with a professional statistician I work with and
>> she agrees the algorithm is not calculating hit ratio as per our
>> definition of what a HIT is. What is does seem to be calculating is a
>> net traffic GAIN ratio.
> Yes.
>
>> What I propose is make the numbers reported as HIT ratios use the same
>> algorithm. The current request ratio one. And to add alongside this a
>> record for Gain/Loss Ratio as output by this byte calculation.
> Why is it interesting to calculate a nicer but very inaccurate number?

Which one is inaccurate?
   "Hits as % of traffic sent" with calculation of (net traffic / client
bytes)
or
  "Net traffic gain/loss" with calculation of (net traffic / client_bytes)
or
  "Hits as % of client traffic" with calculation of ( sum_hits /
client_bytes )

One guess which one we have today ...

> To hide that the proxy cache may actually cause higher bandwidth usage
> than not having the proxy cache?

This is where the mistake rears its head. The excess server-side traffic
is not related to HITs, but to normal proxy behaviour. The HIT % of
client traffic may in fact be reducing that negative from some other
larger negative.
This is why I am more in favour of adding gain ratio alongside the hit
ratios or just changing the descriptive text. The negative is not lost
but explained.

Making HIT % use the same calculation as request ratio would mean adding
HIT traffic byte counters which don't exist now.

>
> I would argue that the request hit ratio calculation is the broken one
> from a statistical point of view.

The byte ratio calculation is simply that a byte ratio, no relevance to
HIT or MISS.

Traffic we classify as MISS is included in the divisor for the existing
byte algorithm.
If it were actually (client_traffic - server_traffic) / hit_bytes or
hit_bytes / (client_traffic - server_traffic) that would be an accurate
HIT bytes algorithm.

Instead we currently have (client_traffic - server_traffic) /
client_traffic which is the gain score for net traffic.

We get asked about "bandwidth gain" often, I think it would be useful to
have something in the report using the term "gain".

Amos
Received on Tue Feb 07 2012 - 12:00:41 MST

This archive was generated by hypermail 2.2.0 : Tue Feb 07 2012 - 12:00:10 MST