Measuring the Web Using a Versatile Meta Information Crawler
Abstract:
In this paper, we present data which characterizes three aspects of Web
interactions: failures, timing performance, and protocol compliance. We
collected the data using our Versatile Meta Information Crawler, which is
designed to acquire a wide sample of the Web, accurately recording its
behavior and performance, and building a large repository of Web page meta
information. We have crawled 300,000 Web pages under 130,000 domain names
and 90,000 IP addresses that are dispersed throughout the Web. The major
findings are as follows. For failures, the likelihood of encountering a
Web failure is 12%. DNS failures account for 50% of all the communication
failures, and "URL Not Found"s account for 90\% of all the transaction
failures. For timing performance, none of the communication phases
dominates the entire Web transaction. We examine each phase in more detail
to identify its empirical parameters. For protocol compliance, persistent
connections are not indicated properly by major servers, and conditional
GET is not sufficiently supported. Based on the data, we suggest a number
of system improvements.