I’ve been monitoring the 404s on this site. I changed our URL pattern a while back, so I have a page that catches all the 404 and resolves the old pattern against the new one, then redirects. Anything that doesn’t resolve gets logged and I have an RSS feed where I can watch them all.
Which brings me to my point: Web spiders are pretty stupid. Ninety-nine percent of 404s to this site are from spiders. They’re looking for URLs that:
I’ve also noticed a lot of one-off spiders that I’ve never seen before. They come out of colleges a lot, it seems.
And, of course, there are hack attempts galore. Trying to hack the XMLRPC vulnerability that was revealed a few months ago is pretty common, and I get scads of long, long requests for things in ”_vti” directories.
That said, monitoring your 404s is a really handy thing to do as it alerts you to a lot of problems. We have over 4,500 entries now, and by watching bad requests, I find out all the time about bad links, missing images, etc. It’s really a good, simple way to give you an extra leg up on fighting content rot.
But don’t think the spiders are the smart ones. You’d think since they were programmed by (supposed) professionals, and have everything in a database somewhere, that they’d be pretty on top of things. My experience, however, indicates that a bunch of two-year-olds mashing on the keyboard would probably come up with more valid URLs than your average Web spider.
Comments
In a sense, isn’t that what a spider does? Except the file stored somewhere could well be a page somewhere else with a pile of dead links..
When you redirect, are you also sending a “301 Moved Permanently” header? On properly coded spiders and bots, I would think, this would help fight the revisitation problem.
At least some of the time, they aren’t being stupid, they are being too smart for their own good as you match on truncations: because so many people return custom error pages with a “200 OK” rather than a “404 Not Found” or whatever the actual error is, many spiders with try to see what your errors look like by forcing one: comparing the results from a good URL with one that they’ve intentionally broken.
Yep.