Google HTML Analysis

By Deane Barker on January 25, 2006

Google Code: Web Authoring Statistics: Google parsed a billion Web pages and pulled some stats out of the HTML.

We can now add to this data. In December 2005 we did an analysis of a sample of slightly over a billion documents, extracting information about popular class names, elements, attributes, and related metadata. The results we found are available below. We hope this is of use!

Some random notes:

  • The most common META tags specified:

  • keywords
  • description
  • robots
  • generator (thanks, FrontPage)
  • author
  • The BODY tag is a huge repository of non-CSS badness (bgcolor, margin, link, etc.)

    Very few people put an “id” on the BODY tag. I do this for pages that directly relate to an identifiable object in the system, so that I can make per-object CSS changes, if necessary (having ‘id=”object_232”’ on your BODY tag is handy like you wouldn’t believe).

  • Very few people use COLGROUP. People should use it more.

  • Most popular class names for elements:

  • footer
  • menu
  • title
  • small
  • text
  • Google notes that these class names map “very well to the elements being proposed in HTML5.”

  • They single out GoLive for crappy HTML:

    GoLive’s footprints are all over the Web. A scary number of pages use

    , not to mention the multitude of , , and elements.

    We have made this same distinction: “Adobe GoLive: Evil Incarnate” Those people should be shot.

  • There were enough misspellings of the “language” attribute for the SCRIPT tag that four of them registered appreciably in the analysis.

  • Really interesting stuff. There are hundreds and hundreds of observations in here worth reading if you have to deal with HTML on a daily basis.

    Gadgetopia
    What This Links To