Supplemental Indexing in Content Management

By Deane Barker on February 21, 2012

I’ve suddenly become quite interested in the idea of supplemental indexing in content management.  This is primarily because I have a long history of hating content management query capabilities.

I’m mystified at systems that are less adept than a simple relational database at letting you have your content back. You put it in the system, then go to get it back out…and the API just sucks.

This still gets me, to this day.  The fact is, the query abilities of most content management systems just suck.  It’s as if they wanted us to put our content in, but never expected us to want to get it back out.  Vendors, if your CMS isn’t at least as good as the SQL databases we all started building back in the day, you really need to re-examine your priorities.

Traditionally, a CMS forces you to rely on their provided API.  The two systems we work with regularly have facilities for querying: eZ publish has their Fetch architecture, and EPiServer has what they call Find Pages With Criteria.  Both are…okay, if a little verbose.  (We worked with Ektron in the past, and their system has traditionally been completely hopeless in this regard, though they’ve done some neat work lately with their new Framework API, and I hear they’ve solved this problem.)

But, there are times when even the most well thought-out API breaks down.  With EPiServer, a client needed to do a simple parenthetical query – WHERE A=B AND ( C=D OR E=F ) – and the Find Pages With Criteria API did not handle it well (or at all, really).  In discussions with other EPIServer developers, the proposed solutions were all some form of code acrobatics for something which SQL handles natively.

I’ve run into the same problem with about every CMS I’ve ever worked with. Querying APIs are fundamentally wrappers around SQL (insomuch as most every CMS is still using a SQL-based database), and they always lose something in the abstraction.  They’re built to handle 90% of the things you can come up with.  The last 10% is where things break down and where you find yourself lamenting, “But I could do this in straight SQL in about two seconds…”

And this is a real problem, because as I passionately wrote in the previously-linked post:

Retrieval APIs are foundational. They are not an add-on. They are one of the pillars of content management, period.

In most cases, your need for content retrieval is not that advanced.  But there are situations when you really want to use a CMS (or part of one) as a big relational database, and unless you’re using a natively relational framework like Rails or Django, or a ORM-based CMS like WebNodes, you’re going to test the limits of your API.

So, recently, Blend has done a few projects where we had to do some advanced querying and we just…punted.  We created companion, query-friendly indices of CMS content alongside the CMS repository, and used them for querying.  If your CMS has a decent event-based API, this isn’t hard, and I’ve found it to be a fantastic way around sticky problems.

For the first project, we used Lucene.Net.  Though Lucene is traditionally used for full-text indexing, we used it here for straight-up field-based search of a subset of the CMS content.  In this case, we denormalized the data considerably, which is a great thing about a supplemental index – you store content in the method that makes it easiest to query, regardless of how many data modeling rules that breaks.

In the second case, we went even more basic – we wrote a plugin for EPiServer which converted portions of the page repository to a set of flat SQL tables.  So we essentially write out a companion database, next to the CMS database, which we use for advanced querying.  This is for a client who is converting existing SQL queries, and it’s worked beautifully to re-use their existing query logic.

For the third project, we hooked up Solr, and used it to store an separate index of all the content in a truly massive eZ publish installation.  We’ve got this dialed-in to the point where could effectively use this separate index for any queries of the repository – you can fundamentally drive the entire site out of Solr, if you want.

These have been great solutions, but using a supplemental index hinges on a couple things –

First, your CMS has to be such that you can keep the index updated.  This means you need an event-capture API so you can be notified of content changes and update the changed content in the index.  Additionally, you need to make sure you capture cascading updates – changing page A may change page B, can how do you capture this?  In a tree structure, changing the parent may change every page below it in the tree, so you need to make sure you can correctly re-index content, even when that content might not have changed directly.

Second, your CMS needs to have a decent filtering API, meaning you can toss a bucket of content at it and say, “Just give me back the content this user has access to,” or “Just give me back the content that’s public and published.”  Your supplemental index probably knows nothing about your permissions model, so you need to make sure your CMS can pick through a raw pile of content and take out what the current user shouldn’t see.

Third, you need to faithfully treat your supplemental index as disposable.  It’s truly supplemental – it’s a querying facility, and nothing more.  You don’t actually read content out of it, rather you just read some page identifier that you then populate through from the core CMS API.  You should be able to wipe out and re-create this index without concern.  A sure way to run into problems is to start using your index as more than just a querying tool.

My initial reaction to the idea of a supplemental index was that it was somehow less “pure” than a real-time query of the actual CMS repository, but my experiences of the last year have completely changed my mind.  Writing your own index can alleviate so many problems that it can free you up spend more time on the more challenging aspects of your integration.

Trying to force a query API to do what even something as simple as SQL has done for years shouldn’t be the hardest thing about a CMS.  Sadly, it often is.

Gadgetopia
What This Links To

Comments

  1. While I see the benefits of a supplemental index in some situations, I also see a lot of potential drawbacks:

    1.) Maintainability – how well will the supplemental index work with new versions of the CMS? How easy will it be for new developers to understand the custom solution, 2 years from now?

    2.) Layered complexity – A supplemental index might seems like a nice and easy solution at first sight, but in my experience you often get into fairly complex problems for basic functionality very quickly, if you treat it as a general query api..

    For example, let’s say you want to fetch the first 20 articles that the current user has access to. Since you want to use the CMS for access control and just send in the ids of articles you want the CMS to fetch, what do you do? In an extreme case, the current user could have access to just one article among the 20 latest articles. How do you make sure that the CMS returns 20 articles for the current user. Do you send in 50 article ids to the CMS? A 100? All article ids? How do you solve paging?

    Issues such as the one above can be solved, but IMO you often end up with a really messy solution.

    3.) Extra work – Creating a good supplemental index takes quite a few hours. Hours that could have been spent creating value for the customer. How competitive will your bids for projects be if you have to set up a supplemental index? How many hours will be spent training your developers on the custom solution? How do you intend to charge for maintenance in the future?

    Overall, I’m not opposed to supplemental indexes for specific pieces of functionality, but I don’t think they are a good solution as a general purpose query API.

  2. Vidar:

    In general, I agree. The fact that we even have to solve this at all sucks, but solve it we must. To your points:

    how well will the supplemental index work with new versions of the CMS?

    The same could really be of any custom code written for an integration. We have custom code all over the place, so this is nothing new. Sure, less is better, but this problem will always be around.

    A supplemental index might seems like a nice and easy solution at first sight, but in my experience you often get into fairly complex problems for basic functionality very quickly

    I have no-doubt this is true on the fringes, but the basic advantages are clear. As for your specific issue — yeah, it’s a pain. But I could see something like Lucene’s PriorityQueue working just fine (that’s how Lucene does it, anyway).

    Creating a good supplemental index takes quite a few hours. Hours that could have been spent creating value for the customer

    Again, the same could be said of anything. And, again, I agree that if we didn’t have to do this, we’d be much better off.

    A key point is that I’m not saying “This is the way it should be!” Instead, I’m saying, “This is the best solution for the unavoidable problem of crappy retrieval APIs!” Solve the first order problem (the crappy API) and I won’t have to solve the second order problem (having a supplemental index).

  3. If you have to find a way to improve the query capabilities of your chosen CMS, I agree that a supplemental index is perhaps the best solution, even considering the drawbacks.

    But the underlying issue here as I see it is that the role of a CMS is changing, from just publishing pages to powering more complex web applications. If this trend continues, query capabilities will become even more important in the future.

Comments are closed. If you have something you really want to say, email editors@gadgetopia.com and we‘ll get it added for you.