The Peril of Self-Replicating Hyperlinks

By on May 2, 2008

I built an intranet for a client. One of the functional items is a viewer into an Exchange calendar. We use a handy third-party component to display the contents of an Exchange public folder on a page.

The month and year to be viewed is driven off the querystring. Something like:

/month.aspx?m=11&y=2010

So you can look at any month by writing your own querystring. We check for valid input and everything, but so long as you enter a valid month and year in the querystring, you can (could) look up any logical month in existence, as far ahead or behind as you want.

Each month has helpful “Next” and “Previous” links on it that form the URL for the next or previous month.

Sadly, we’re also indexing the intranet via a Google Mini.

Astute readers will see the problem here…

Two things happened:

  1. The number of pages in the Mini spiked. The client was suddenly hitting their document limit. They only had about 10,000 actual pages of content, but the Mini was claiming it had indexed four or five times that number.

  2. We started to get reports about odd months being returned in search results. Months like “November 2609” for example…

The Mini’s crawler, bless its heart, was dutifully following the “Next” and “Previous” links in the calendar into infinity in either direction. It was, in effect, inventing its own URLs…forever. Every new page in the calendar gave it a new URL it hadn’t seen before. The Mini’s crawler had fallen down the rabbit hole.

Easy problem to fix, but an embarrassing oversight nonetheless. We now drop the “Next” and “Previous” links at 24 months out in either direction, and we throw a 410 for anything outside those bounds in the past, and a 404 for anything outside those bounds in the future.

I just checked today, and the number of pages in the Mini came down 2,000 yesterday, as it rechecks out-of-bounds URLs and gets back 410s and 404s.

I wonder how many sites on the public Internet have this same problem? I wonder if crawlers have any logic to detect this?

###

What This Links To

Comments

  1. Damion says:

    I've seen this happen with Joomla installs and calendar compenents. The google web crawler will follow those links until your hosting account bandwidth is used up. We've had two customers that had that issue. It was a quick fix using and htaccess file and disallowing indexing on that component.

    Either way, we learned to be a little more careful with calendars... just as you have!

Add a Comment