Spiders are Stupid

By Deane Barker on November 4, 2005

I’ve been monitoring the 404s on this site. I changed our URL pattern a while back, so I have a page that catches all the 404 and resolves the old pattern against the new one, then redirects. Anything that doesn’t resolve gets logged and I have an RSS feed where I can watch them all.

Which brings me to my point: Web spiders are pretty stupid. Ninety-nine percent of 404s to this site are from spiders. They’re looking for URLs that:

  • …that they couldn’t possibly have derived from any other page on the site.
    Oftentimes they screw up relative vs. absolute URLs. I usually go check, just in case I forgot to put “http://” in front of something, but I usaully find everything is in order and it must just be the spider that’s confused.
  • …existed a long, long time ago.
    I still get spiders coming in for pages with URLs that haven’t been around for three years. They must have them stored somewhere because every once in a while I’ll get about 300 consecutive requests from the same spider for the same old pattern, like it was reading them from a file somewhere.
  • …are obviously munged.
    Spiders truncate a lot, or insert random spaces in URLs. I finally modified my lookup script up to remove spaces from the target URL first, and, if it can’t find what the want, try to match what they ask for at the front of a string, so I can catch truncations.

I’ve also noticed a lot of one-off spiders that I’ve never seen before. They come out of colleges a lot, it seems.

And, of course, there are hack attempts galore. Trying to hack the XMLRPC vulnerability that was revealed a few months ago is pretty common, and I get scads of long, long requests for things in ”_vti” directories.

That said, monitoring your 404s is a really handy thing to do as it alerts you to a lot of problems. We have over 4,500 entries now, and by watching bad requests, I find out all the time about bad links, missing images, etc. It’s really a good, simple way to give you an extra leg up on fighting content rot.

But don’t think the spiders are the smart ones. You’d think since they were programmed by (supposed) professionals, and have everything in a database somewhere, that they’d be pretty on top of things. My experience, however, indicates that a bunch of two-year-olds mashing on the keyboard would probably come up with more valid URLs than your average Web spider.

Gadgetopia
What This Links To
What Links Here

Comments

  1. They must have them stored somewhere because every once in a while I’ll get about 300 consecutive requests from the same spider for the same old pattern, like it was reading them from a file somewhere.

    In a sense, isn’t that what a spider does? Except the file stored somewhere could well be a page somewhere else with a pile of dead links..

  2. When you redirect, are you also sending a “301 Moved Permanently” header? On properly coded spiders and bots, I would think, this would help fight the revisitation problem.

  3. At least some of the time, they aren’t being stupid, they are being too smart for their own good as you match on truncations: because so many people return custom error pages with a “200 OK” rather than a “404 Not Found” or whatever the actual error is, many spiders with try to see what your errors look like by forcing one: comparing the results from a good URL with one that they’ve intentionally broken.

Comments are closed. If you have something you really want to say, email editors@gadgetopia.com and we‘ll get it added for you.