The Peril of Self-Replicating Hyperlinks

May 2

The Peril of Self-Replicating Hyperlinks

I built an intranet for a client. One of the functional items is a viewer into an Exchange calendar. We use a handy third-party component to display the contents of an Exchange public folder on a page.

The month and year to be viewed is driven off the querystring. Something like:

/month.aspx?m=11&y=2010

So you can look at any month by writing your own querystring. We check for valid input and everything, but so long as you enter a valid month and year in the querystring, you can (could) look up any logical month in existence, as far ahead or behind as you want.

Each month has helpful “Next” and “Previous” links on it that form the URL for the next or previous month.

Sadly, we’re also indexing the intranet via a Google Mini.

Astute readers will see the problem here…

Two things happened:

  1. The number of pages in the Mini spiked. The client was suddenly hitting their document limit. They only had about 10,000 actual pages of content, but the Mini was claiming it had indexed four or five times that number.

  2. We started to get reports about odd months being returned in search results. Months like “November 2609” for example…

The Mini’s crawler, bless its heart, was dutifully following the “Next” and “Previous” links in the calendar into infinity in either direction. It was, in effect, inventing its own URLs…forever. Every new page in the calendar gave it a new URL it hadn’t seen before. The Mini’s crawler had fallen down the rabbit hole.

Easy problem to fix, but an embarrassing oversight nonetheless. We now drop the “Next” and “Previous” links at 24 months out in either direction, and we throw a 410 for anything outside those bounds in the past, and a 404 for anything outside those bounds in the future.

I just checked today, and the number of pages in the Mini came down 2,000 yesterday, as it rechecks out-of-bounds URLs and gets back 410s and 404s.

I wonder how many sites on the public Internet have this same problem? I wonder if crawlers have any logic to detect this?


Comments

by Damion,   May 2, 2008 8:42 AM  

I've seen this happen with Joomla installs and calendar compenents. The google web crawler will follow those links until your hosting account bandwidth is used up. We've had two customers that had that issue. It was a quick fix using and htaccess file and disallowing indexing on that component.

Either way, we learned to be a little more careful with calendars... just as you have!



Add Comment


Want to advertise on this site? Contact FM.
Laser Toner Cartridges UK laser toner, toner cartridges, hp toner, lexmark toner, samsung toner, canon, toner, epson toner, oki toner, kyocera toner, xerox toner, remanufactured toner, compatible toner
Direct TV Deals Free 4 room direct tv deals. no equipment to buy. free fast professional direct tv installation. this is the best direct tv deal available anywhere.
SEO Article Learn from the experts with our SEO article.
rope light Shopping with birddog distributing, inc., gives you access to the lowest prices, the best customer service and the quickest delivery times possible.
Laptop AC Adapter We offer genuine factory direct replacement AC adapters.
Direct TV Best satellite TV deals.
Direct TV Deals Direct TV programming deals are varied and include packages containing from 50 channels up to over 250 channels.
8mm film to DVD Retain family memories with the only frame by frame digital restoration service in the United States for your 8mm film to DVD today
Rubber Stamp Shop for custom self-inking stamps, hand stamps, address stamps, label stamps, check endorsement stamps, check deposit stamps, date stamps, pre inks, pocket stamps, ink and much more!