On Tue, 26 Apr 2011 20:44:33 +0100, Sheridan "Dan" Small wrote:
> Thanks for your reply Amos.
>
> The tests are a suite of largely accessibility tests with some
> usability tests for web pages and other documents. Some are based on
> open source software, some are based on published algorithms, others
> (the problematic ones) are compiled executables. The tests are
> generally originally designed to test a single web page. I am however
> attempting to test entire large websites e.g. government websites or
> websites of large organisations. Data is to be collated from all
> tests
> on all web pages and other resources tested. This data is to be used
> to generate a report about the whole website not just individual
> pages.
>
> Tests are largely automatic with some manual configuration of cookie
> and form data etc. They run on a virtual server. The virtual server
> is
> terminated after one job and only the report itself is kept. All
> runtime data including any cache is not retained after the one and
> only job.
>
> A website e.g. that of a news organisation, can change within the
> time it takes to run the suite of tests. I want one static snapshot
> of
> each web page, one for each URL, to use as a reference and not have
> different tests reporting on different content for the same URL. I
> keep a copy of the web pages for reference within the report. (It
> would not be appropriate to keep multiple pages with the same URL in
> the report.) Some of the tests fetch documents linked to from the
> page
> being tested; therefore it is not possible to say which test will
> fetch a given file first.
>
> Originally I thought of downloading the files once writing them to
> disk and processing them from the local copy. I even thought of using
> HTTrack ( http://www.httrack.com/ ) to create a static copy of the
> websites. The problem with both these approaches is that I lose the
> HTTP header information. The header information is important as I
> would like to keep the test suite generic enough to handle different
> character encoding, content language and make sense of response
> codes.
> Also some tests complain if the header information is missing or
> incorrect.
>
> So what I really want is a static snapshot of a dynamic website with
> correct HTTP header information. I know this is not what Squid was
> designed for but I was hoping that it it would be possible with
> Squid.
> Thus I thought I could use Squid to cache a static snapshot of the
> (dynamic) websites so that all the tests would run on the same
> content.
>
> Of secondary importance is that the test suite is cloud based. The
> cloud service provider charges for bandwidth. If I can reduce repeat
> requests for the same file I can keep my costs down.
Hmm, okay. Squid or any proxy is not quite the right tool to use for
this. The software is geared around providing those latest greatest
revalidated versions of everything on request.
I think the spider idea was the right track to go down...
We provide a tool called squidclient which is kept relatively simple
and optimized for integration with test-suites like this. Its output is
a raw dump of the full HTTP reply headers and body, with configurable
request header and source IP display as well. It can connect and run
directly on any HTTP web service (applet, server or proxy) but does
require a proxy to gateway FTP and HTTP services.
An alternative I'm nearly as fond of is wget. It can save the headers
to a separate file from the binary object and has many configuration
options for spidering the web.
curl is another popular one possibly worth your while looking at, I'm
not very familiar with it though.
Amos
Received on Wed Apr 27 2011 - 00:10:27 MDT
This archive was generated by hypermail 2.2.0 : Wed Apr 27 2011 - 12:00:03 MDT