Fighting Content Rot

By Deane Barker on July 10, 2004

If you manage a Web site for more than a few months, you run into problems of content rot. You’ll be cruising through some old pages, and you’ll find stuff that’s…off, for one reason for another.

For instance, when this blog first started, I was anal-retentive about enclosing BLOCKQUOTEd text in quotes. It was a quote, after all. I would go through all the text I quoted, find double quotes, convert them to singles, then surround the entire thing in double-quotes before BLOCKQUOTEing the entire thing.

Now, this was very admirable of me, but when I started inviting others to blog with me, that whole concept broke down. Not everyone was doing it, and since it wasn’t consistent, I didn’t want to do it at all. However, there are still a thousand or so entries sitting out there with quotes around them.

Just recently, we started to standardize code fragments we post with by using the CODE tag and the SimpleCode script. There remain, however, a hundred or so posts with code hacked up in BLOCKQUOTEs or DIVs or God knows what.

These aren’t an isolated cases — there are styles that we’ve since abandoned, double-dashes that haven’t been replaced with the — entity, etc. I try to nail these things as entries hit the site, but I miss some. On top of all this, throw in link rot — links that just 404 over time — and comments. Ugh, comments…

I try to stay on top of comment spam, but I’m sure some get through. Additionally, there are stupid comments that slip by (why do people insist on testing my comment form with ‘fgfgfgfgfgf’ all the time?), and comments that aren’t relevant any longer — people complaining about bad links that I’ve fixed or mis-spellings that I’ve corrected.

Categorization is another thing. I added the Temple of Mac category at about entry #1,600. However, I didn’t bother to go back through all the old entries and move all the Mac-related entries to the new category.

Mix all this together, and you have a site that doesn’t really age well. I’m sure if I tooled through 100 old entries, I’d have something that needed to be fixed or corrected in at least 40 of them. How do you handle this? Gadgetopia is hurtling toward entry number 3,000, and that’s a lot of volume.

I’ve often thought that I should create a script that just generated 10 random entries a day for me to review. Each morning, I’d get an email with 10 entries in it that I need to look over and touch up. But how do you make sure you get them all before you start getting duplicates? I suppose you could log them all in a table and then join the entries table against it to filter out entries that had already been covered. Like this:

SELECT e.entry_id FROM mt_entries e LEFT JOIN already_reviewed r ON e.entry_id = WHERE IS NULL ORDER BY RAND LIMIT 10

(I haven’t tested this SQL, mind you.) Wrap some PHP around this, schedule it for the middle of the night, and you’d have 10 entries every morning that you can tune up. Perhaps I’d send 10 to myself, and three or so to each of the rest of the authors.

I think, however, I’m going to try something different. I’m on the verge of putting another sidebar on the front page called “One Year Ago Today” that lists the things were we talking about a year ago (see the OnThisDay plugin). I’ll schedule an automatic rebuild of the front page every morning at 1:00 a.m., then check the year-old entries while I’m eating my Crunchy Corn Bran in the morning.

Maybe this will work, maybe it won’t. If someone wants to take a stab at the mailer script (or if you already have), please post a link. If anyone else has any thoughts about content rot, let’s hear them.

What Links Here


  1. I saw a show today talking about this self-inflating travel doggie bed that I could buy, so that when I’m taking a road trip, my dog has somewhere to sleep, and it’s so nice and compact that I can bring it on every trip.

    What does this have to do with content rot? I forgot.

    Wait, here it is. My dog can sleep on the floor. He does it at home. He’s a dog. If he gets cold, he sleeps on the heater vent (so I get cold), if he’s hot, he sleeps on the linoleum.

    The travel doggie bed is pointless. A lot of effort and time was put into something that really was of marginal value.

    I look at content rot that way (although to a lesser degree). I can see cleaning up broken links, but why does it matter if a dash is a hyphen or an emdash? Are we ever going to parse all the content and tokenize it by dashes?

    To some degree, I look at most gadgetopia posts as somewhat content rotted just based on their content. If we post a cool tip, a book review, or an insightful look at something, that may have long term value, but if we post about a planned feature for Longhorn, that post will be worthless the minute Longhorn rolls out. If I post that a new version of some software is out, or some new product, then that’s very time-sensitive, so that post is probably worthless in about a week. I’d say as long as it’s legible, leave it alone, since you’re unlikely to get any return at all on the time invested.

    OK, enough navel inspection.

  2. Sure, for this site, the only justification is my anal-retentive-ness. However, think about something like, say, MSDN. They have thousands of thousands of pages — how do they make sure that everything on those pages is in good shape? How do you make sure you’re adequately maintaining 10,000 pages of content?

    We’ve all had sites with pages that just kind of fall through the cracks. You’re cruising around your company site, and you stumble on a “Services” page that obviously hasn’t been touched in years that mentions services you don’t even provide anymore.

    What processes do you put in place to make sure that every page on your brochureware site is reviewed every 60 days, for example?

  3. I don’t think that MSDN has any magic bullet in this department. I’m constantly running across broken links in their docs. It used to be that you’d see old page layouts as well, but I think they’ve probably since upgraded to some content-managed affair for the layouts. I think it’s one of those things though, where you fix the things that people have problems with. If noone notices it, it’s not a problem.

  4. But if you ran across the broken link, then you noticed it. I don’t like that whole, “let it decay and just fix it if somebody complains” theory. That bugs me.

  5. I have no doubt that it bugs you – I’m not terribly fond of it myself. But the reality is, what’s the return on investment? If I spend 100 hours keeping old content up to date, and only 1 person ever looks at 1 page of it, that’s a pretty low value proposition.

    Maybe a smarter approach would be to watch the traffic stats, find out which old pages are still being viewed, and proscribe them a level of maintenance based on their popularity.

  6. Are there any industry best practices for this kind of thing? I have been tasked to look and see if there is; I find a lot about CMS and ECM, but nothing at a high level that is more process related. Does anyone know what is out there?

Comments are closed. If you have something you really want to say, tweet @gadgetopia.