Intra-link Management in Content Management

By on February 25, 2012

Linking pages in a CMS can be tricky.  It helps to understand why.

Say some text in Page A links to Page B, both of which are in the same CMS. This link is buried in some HTML, deep in a WYSIWYG field of Page A.  What URL do you use to link to Page B?

Well, you could just use Page B’s public URL, right?  Sure, let’s just put that as the HREF in the A tag, and bake that into the HTML that we store.  That’ll work just fine.

Actually, not so much.  What happens if Page B gets deleted?  How will we know that’s there’s an outstanding link from Page A?  What if the URL for Page B changes – either Page B moves and this changes the URL, or some marketer decides she likes dashes better than underscores?  Now we’re screwed – we have a broken link, and worse, we don’t have no way of finding it without stumbling across it.

(This is not at all an academic problem.  Gadgetopia has switched URL schemes twice in its 10-year history.  I still have a big lookup table of old-URL-to-new-URL pairs.)

The simple fact is that a link between content is two things.  It’s of course the actual HREF tag that visitors will click on.  But, in a larger sense, it’s also a conceptual relationship between content.  This relationship transcends the idea of what the link is – an HREF, usage of an image, whatever – and instead represents the basic idea that Page A relies on Page B for something, and if Page B were to change, this may have ramifications for Page A.  This is an important thing to know.

When you link between two pages in a CMS, your CMS really needs to do to things:

  1. Make the link durable, meaning it survives changes to the target URL.
  2. Make the link discoverable, meaning we have some way to finding out that Page A is depending on Page B in some way.
Making the link durable is a little tougher than you might think.  The obvious solution is that you link to some identifier for Page B, rather than to Page B itself.  This is the value that gets hard-baked into the HTML that’s stored in the repository.  This is the simple part.

This consequently raises the question, when do you do the replacement?  When do you swap out the identifier of Page B for the actual, current URL of Page B?  I’ve seen a couple options.

Ektron never did.  When you linked to Page B from Page A, Ektron stored a redirection URL – something like :

/workarea/linkit.aspx?id=356

When a visitor hit that link, they got sent to a page which read in the ID, looked up the correct URL, and bounced them over to the current location of Page ID #356.  This is the simplest method, because it requires no processing of the HTML – the HREF that’s stored is the same location that gets output to the page and the same location you get sent to, you just get redirected from there.

There are a couple disadvantages, however.  First, every internal link ends up being the same, so users can’t mouseover a link and look in the status bar to see where they’re going.  (How many users do this?  I have no idea, but I do it myself a lot.)  Additionally, purists just find this redirection idea…icky.

Additionally, there may be some drawback with search indexing.  I had a very SEO-focused client who hated the “One URL to Rule Them All” method, so they manually entered and maintained “pure” URLs in an Ektron installation to avoid it.  I don’t recommend going to this trouble for SEO, however, as it’s assumed that the value of self-referential PageRank is negligible. (I have no idea whether an internal search tool like the Google Search Appliance tracks self-referential PageRank, or whether or not it has some method of matching up redirects to their destination pages.  It possible.)

(Also, whenever I discuss Ektron, I have to mention that my knowledge of their system is a couple years old.  Things may have changed since I worked with it last.)

EPiServer, on the other hand, stores the link as an identifier, and then does a very late swap – I believe as late as a Response Filter which filters the HTML as it leaves the server.  So, they pick through the HTML, finding everything hyperlink, check if it’s an identifier link to an internal resource, then “fix” the HTML.

Now, this method is super-clean – this delivers HTML with links that were correct at the instant the response leaves the server.  However, it’s resource-intensive.  EPiServer has stated that this operation takes between 5% and 20% of the entire computational load of a page request.  Consequently, they have written their own optimized HTML parser to streamline it as much as possible, and now they’re rewriting it to make it even faster.

Making the links discoverable generally involves parsing the HTML on save, and maintaining a graph or network of links.  This is not a complicated process, but can be computationally expensive and is often done asynchronously.  So, when a content is saved, some separate thread or process comes through and picks through all the HTML items in the content looking for hyperlinks. With each link, it has to determine if the link is to another page in the CMS, and then enter some record binding those two pages together.

This record needs to store identifiers for the two pages, obviously, but it should also store the directionality of the link (if A links to B, then A is dependent on B, but not vice-versa), and the type of relation. Beyond your basic hyperlink, there are a few other relation types worth noting:

  • Relational property links (e.g. – the “Author” property of an article links to the author’s content record)
  • Image usage (e.g. – this particular image of a teapot is used on this particular page)
  • Embedded content usage (e.g. – a common fragment of managed text is used on 27 different pages in the CMS)
If someone tries to delete or unpublish the target, this graph can be consulted to see what other pages it might affect.

(Automatic content expiration can be problematic here.  We had a client who would schedule content to be expired automatically.  However, the expiring content would often be the target of inbound links, which would then break without warning.  Our solution was a nightly job that would look for pages expiring in the next 72 hours, and check the link graph to see if they were a target of any links.  If so, we sent an email warning to the webmaster.

An argument could be made that you should warn the editor of inbound links when they schedule content expiration, or warn them when they link to content that has an expiration date scheduled.  But if they it anyway, which is worse – not expiring content when it should expire, or knowingly breaking a link?)

eZ publish actually has a really nice system here where each content item has a “Relations” tab which shows you all the relationships between this piece of content and any others in the system (regardless of directionality) and the nature of those relationships, so you can instantly see all the different ways this content item is used by or is using other content in the repository.  (Now that I think about it, I could build the same thing for EPiServer in about 30 minutes.)

Link management can be unpleasant.  It’s one of those things that gets overlooked and is often ignored or handled poorly.  I worked on a Drupal install a couple of years ago, and I couldn’t find any clean way to effectively manage links that editors inserted into content (this may have changed since then, I have no idea).

The result is this lingering, background unease, as you contemplate multiple editors inserting URL after URL into WYSIWYG editors.  All those URLs, just sitting around waiting to break and start growing content rot.  It’s almost enough to keep me up at night.

###

Comments

  1. Link management in CMS has always been a hot topic. SDL Tridion (disclaimer: I work for SDL) has been doing this for over 10 years with the concept of Dynamic Links whereby the content reference is always stored as an ID but "resolved" at run time through the delivery API (Java or .NET) allowing implementers to decide what to do with that link: If target is available:

    • Link to the correct page (or binary) If target is not available:
    • Show the text but not the link (useful for links within text)
    • Hide both text and link (useful for links for navigation, for instance)

    With caching turned on, link resolving has a very small impact on any page (recently I measured 3 milliseconds to "resolve" 100 links). The downside of this is that your web layer cannot be static html, it must use either Java or .NET (or any language that can talk to an OData webservice).

    N

  2. Great article, thanks.

    Very important feature for any site that is bigger than hundred pages. My favorite CMS ProcessWire handles this nicely at run time: http://processwire.com/talk/topic/236-module-page-link-abstractor/

    If you have build your site well you can keep the content links in minimum with good use of real page relations, but of course you always need in-content links also.

Comments are closed. If you have something you really want to say, email editors@gadgetopia.com and we'll get it added for you.