I changed the URL scheme of this Web site over the weekend. I had been meaning to do it for a while, but some problems with Movable Type 3.2 kind of forced the issue. (I have got to stop rushing into every beta that presents itself…)
To make everything backwards compatible, I built a simple redirect system — I have a table in the database with every single permalink from the old site (all 9,000 of them — including entry RSS feeds and category pages) mapped to every single new URL.
If someone looks for a page which has moved, the 404 page does a lookup on this table, “resolves” the old URL against a new one, then redirects with a “301 Moved Permanently.” It seems to work well.
A side benefit of this system is that I can watch for “unresolved 404s,” meaning 404s that were not in my lookup table — a genuine 404, if you will. I’ve noticed some interesting phenomena:
I get hammered by referrer spam. We’ve talked about this before — this is spam created by a bot hitting any page with a fake referrer string in the hopes that you’re displaying your referrers on this site (a la Dean Allen’s Refer or similar tool).
This results in fully half the unresolved 404s on this site coming from casino bots hitting URLs that are three years old. I know they’re that old because they use the very first URL scheme I had for this site — the default Movable Type archives URL: “archives/000355.html”, etc.
They must be working off a very old list of URLs, which I find quite funny, and quite interesting. Why would they keep an old list of URLs lying around? Why not just re-spider? Do spammers sell lists of URLs like they do lists of emails?
Browsers and spiders sometimes mangle HREFs. I see impossible URLs that can only result from a mis-interpretation of the HREF in the link. IE 5.x on the Mac, for example, has problems with background images coded in CSS. The see that browser try to get this a lot:
It's just mangled the URL of the image.
Others, however, are more mysterious. Just two minutes ago, a spider tried to access a URL that it could only have hit if it missed the leading "/" in the HREF. Coming from this page...
…the spider tried to hit:
I just checked that page and there’s no way it pulled that URL out of the code. The correct URL was…
But the URL it bounced off of could have only happened if it had a bug of some kind or if the HTML got mangled on the way down.
I also get hits to things like this:
No mystery here — that’s just a truncated version of this:
Truncation, it seems, happens a lot. The Ask Jeeves/Teoma spider, for instance, has been trying all day to get at URLs that are all truncated at 39 characters. Add “http://www.gadgetopia.com/” to that, and you get 64 characters.
Why is this, I wonder? Was that the size of the database field they stored the URL in? More importantly, does it explain why I’ve never done so well in that index? I’m wondering now if my previously-long URLs have hurt my engine placement in other indexes besides Google.
As implied by the preceding two points, the vast majority of 404s are from bots. I’m sure this is true for all sites, but I never realized it so much until now.
Hack attempts abound. There are lots of attempts to hit DLL files in the (non-existant) “MSOffice” and “_vti/” directories. These are people trying to hack Outlook Web Access and various Web-enabled Microsoft Office technologies.
Spiders don’t crawl and index in the same pass. I changed the URL pattern late Friday night, then changed my mind about pattern to use when I woke up the next morning. This means the site was accessible under a certain pattern for about eight hours.
In the following 48 hours, I saw attempt after attempt by bots to get to files under that pattern. This tells me that a crawler made a pass at the site during that eight hour window and stored the URLs it found. Then an indexer used that list to come back through the site a day later and index the text (sadly, in this case, the pattern had changed — I’ve since put in a RewriteRule to catch those).