A Large-Scale Study of the Evolution of Web Pages: This page may seem terribly long and dry, but it’s fascinating. Researchers from Microsoft and HP (as part of the PageTurner project) were measuring the rate of change and decay of Web pages over time.
“Between 26 Nov. 2002 and 5 Dec. 2002, we performed a large crawl (‘crawl 1’) that downloaded 151 million HTML pages as well as 62 million non-HTML pages, which we subsequently ignored. We then attempted to fetch each of these 151 million HTML pages ten more times over a span of ten weeks.”
They mined the results and explained their findings in this paper, which was presented at the 12th International World Wide Web Conference in Budapest in May.
Less than half of the pages (49.2%) were accessible on all 11 attempts. The number of 200 status codes returned degraded throughout the course of the 11 crawls, dropping to just under 90% at the end, with most of the change coming from documents returning redirections or 404s. The worst TLD for reliability seems to be .cn (less than 80% of Web pages still around at crawl 11), and the best is .dk (over 95%).
Do yourself a favor and scan this report — there’s all sorts of interesting data in it.