If you want to be friendly to the web, do the world a favor and start using the canonical URL LINK tag. More things than you realize depend on the simple principle of identifying a page by a unique URL, and it’s getting harder than you think.
It turns out that a web without canonical URLs is like a database without primary keys.
I recently wrote a web crawler (yes, seriously), and I was forcefully introduced to how vague URLs can be. The idea that a unique page of content has a single URL is laughably naïve.
Turns out that the absolute hardest part of writing a crawler is “normalizing” URLs – taking two URLs, and trying to figure out if they’re actually addressing the same resource. Fact is, you can address a page of content in more ways than you think.
Some examples —
You need to account for SSL vs. non-SSL. A lot of sites will accept inbound requests for both “http” and “https” to the same URL, and return the same page of content. This technically results in two separate URLs, and if a crawler is cataloguing URLs, it needs to account for the same that this is really the same page of content, even if the bytes of the URL differ.
Now, that one isn’t too hard. There aren’t many pages that differ remarkably if they’re secure or not. But what about domain? Your website could respond to multiple domains. It could be as simple as the same content coming up under “www.gadgetopia.com” and “gadgetopia.com”, or as complex as hundreds of different domain names generating the same pages.
It gets worse – what about querystring arguments? The fact is that different arguments have different degrees of import. Some are critical in determining the content of the page (“article_id”) and others really only matter to humans interacting with the page (“return_page”). There’s a whole bucket of querystring arguments that really have no effect on the core content being returned to the user agent.
(URL arguments for analytics are especially bad. Click a link out of a Feedburned blog post, and you end up with “utm_source” and “utm_medium” as querystring arguments, none of which have any bearing on the actual content of the page returned.)
Differing capitalization could technically result in different pages too (although this would be terribly bad form…)
I could go on and on about URL vagaries, but just understand that this URL —
— and this URL —
— may return the exact same page of content, but I have no way of knowing this.
On a known site (a site I own or am crawling for a client), I can make some rules, like always knowing that I should swap “domain.com” for “www.domain.com,” but if I’m doing a crawl of a site I have no connection with (a “hostile” crawl?), then I just have to assume those two URLs are actually two separate piece of content and index them as different pages even though “article_id=5” probably indicates they return the same thing.
And none of this takes into account the new world of visitor segmentation and anonymous personalization. If you live in California, you might get a different page then if you live in New York. So where is your crawler coming from, and how is it ever going to emulate someone from somewhere else?
(For a while, I tried to abandon URLs and hash the actual HTML returned, then compare the hashes. This would tell me, more clearly, if this page is unique. But that too is problematic for a number of reasons – sometimes querystring arguments, for instance, change the page in tiny, effectively meaningless ways, but ways which result in an entirely different hash.)
This is where canonical URLs help. For each page content, have a canonical LINK tag which indicates the one true URL this should be accessed under anonymously. Here’s Google’s page about them, and here’s what one of them looks like.
<link rel="canonical" href="http://example.com/123"/>
It’s not just crawlers that depend on this — any site which needs to tell one URL from another would benefit from this. If you submit a URL to Reddit, it checks to see if it’s been submitted already. To do this, it depends on the fact that the URL has some consistency.
If you are writing software that somehow keys of a URL, look for a canonical LINK tag and use it if you find it. By including it, the site owner is doing you a massive favor. Don’t ignore it.
Using a canonical URL is like declaring a primary key on your content. You are saying, effectively, that “no matter how you actually got to this page of content, this URL is the official URL for this page and should be used when discussing this page.”
The web will be a better place for it.