Middle Ground: Content Management using Static HTML

By Deane Barker on November 23, 2005

I’ve been toying with an idea lately, and instead of actually doing it (don’t have the time), I’m going to throw it out here for fun. My idea is for an extremely simplistic content management system – one based on HTML files and a scheduled file system crawl.

First, some things I believe:

  • Creating content on the Web is a huge drag. Yes we can do great things with HTML forms and Ajax and WYSIWYG editors and all that, but in the end, you’re still composing content in a Web page. Even the best environment can’t compare to the functionality in FrontPage, Dreamweaver, or Nvu (if they fixed a few flaws).

  • Content is generally more strictly managed in most systems than it really needs to be. The fact is, most content we deal with is a title, a big chunk of text, and perhaps a few additional properties.

A movie review, for instance, is a title, a big chunk of text, and perhaps a movie poster and the number of stars. A product listing is a title, a big chunk of text, and maybe a handful of specifications.

There are always exceptions, of course, but the fact remains that you don’t see a lot of really complicated, relational databases jammed into content management systems. Most systems manage pretty loose content.

At the same time that I’ve come to believe what I’ve written above, I’ve had some experience managing some larger static sites. When you manage a static site of 100 pages or more, you quickly run into two big problems:

  1. Enforcing consistency between pages

  2. Managing menus and index pages

Here’s how I handle the first one, and here’s a theory I’d like to try on the second.

Enforcing Consistency

I have a PHP prepend and append file, so every hit to a PHP page gets “bookended” by these two files. The prepend file starts buffering, and the first thing the append file does is read the buffer into a variable. Nothing too out of the ordinary there.

But then the append file compares the URL pattern to a series of regex expressions, stopping on the first one it matches. Based on the match, the append file inserts HTML just under the open BODY tag, and just above the close BODY tag. It also inserts a suffix to the TITLE tag, and a stylesheet link just under the TITLE tag. (And I’m modifying it this weekend to also insert a submenu, when specified.)

What this means is that when I create a static HTML page for a site under this system, I don’t have to worry about header or footer includes, TITLE tag format, stylesheets – anything like that. I just compose in simple, unformatted HTML, and when the page is requested, it gets processed. Put another way, all the formatting of the page, from headers to footers to styles, is centrally controlled. The page author has no choice.

It also means you can vary things greatly by section. The main part of your site can look a certain way, and a subsection (designated by URL pattern), can look completely different. Put a page in Folder A and it looks one way, but move it to Folder B and it looks completely different.

This system has worked extremely well for me, and has enabled me to keep a hundred or so static HTML pages totally consistent with each other. What it ultimately means is that the page on the file system stays “pure.” There’s no need for PHP code, file incudes, stylesheet references, etc. All that’s in the file is the actual content that’s supposed to be there.

Managing Menus and Index Pages

How do I maintain an index page of news articles without hand-coding it and updating it whenever I add or remove a page? How do I keep track of what goes on the front page of the site? And if I delete an HTML file, how do I know all the index pages in which it appears so I can remove reference to it?

The bottom line is that even when you have your static files managed as perfectly as possible, you still have problems relating all this content and keeping it organized and accessible. So how do you cross that chasm without going to full-blown content management?

Here’s an idea:

Create a scheduled process that crawls your HTML files and converts them to database records. Then use these records to power your index pages and other dynamic sections of your site.

An example:

Say I have a folder full of movie reviews. Each one is a static HTML page. I want to have an index page listing all the reviews. This is actually pretty simple – I just have a scheduled process that crawls the folder, extracts the TITLE tag from each file, and logs it with the filename in a database table. Run that process once an hour, and then pull from the database table to run your index page.

But what if I wanted to have the star ratings and a one-line summary of the review on the index page too? Where do I put that stuff? In the page META. Have a META tag for “description” and another for “star_rating.” Then, when your process crawls the folder, log those in separate database fields.

(Yes, yes, there are potential datatype issues here. But your users just need to be careful and be notified when there’s been a problem. If the crawler finds anything other than an integer in the “star_rating” META tag, it skips it and logs an error.)

Some other thoughts on this.

  • You can create META for about anything you need. You could have a META field for “front_page” and set it to “1” for any movie review that should be on the front page. When that review should come off the front page, change the META tag and re-index.

There’s no need to change an admin interface when you do this (the HTML file is the admin interface). And if you put your META fields in a table in key-value format, you don’t even have to change your data model when you start or stop using a certain META tag. The indexer would just log everything it found without question.

  • Foreign keys could be loosely emulated by using the filename of the foreign record as a key. For instance, the “author” META tag could be “dbarker.” There would be a file named “dbarker.html” in the “authors” folder which is being managed by the same process. All it takes is a simple join to bring the two tables together.

  • Full-text search would be even simpler yet. When your process crawls the files, just have it log the entire text of the page (sans HTML tags) in another field. (Please tell me you saw that coming.)

  • What if you delete an HTML file? Until the process runs again, you have a record in the database without a file.

To handle this, use the 404 as an alert. When a page is not found, have the 404 page look for the database record corresponding to the missing file and disable it so that page reference instantly comes out of all index pages.

  • Ideally, instead of a scheduled process, find one of those little monitoring apps that will watch a folder and run a batch file when it detects a file change. Use this to instantly re-index single files when they change.

  • You could have a file or table that specified META that should be removed before display, if you didn’t want all META to be sent to the browser. If you want some META there for the indexer, but you don’t want the user to see it, the append file mentioned earlier could suck that out before flushing the buffer.

So that’s the gist of it. The HTML file itself really becomes the database editing interface – it’s the bridge between the user and the database. The user can manage the file however he or she feels like it. At a certain interval (or on demand), the files get converted into database records which are awfully easy to query, manipulate, and display.

(Note that anything I said here about database tables goes the same for search engine indexers. A search engine like Swish-E could do pretty much everything I’ve described and it’s monstrously fast. Running on a 2.4GHz P4, Swish-E indexes all 4,600 HTML files on this site in eight seconds of CPU time. See this post.)

I envision a simple Web interface where the site admin can login, then:

  1. traverse the HTML folder structure and view files

  2. re-index individual files or entire folders on-demand

  3. kick off a full-scale index of the entire site

  4. browse the logged meta

  5. run test SQL

  6. see the results – including error reports – of previous crawls

  7. Specify headers, footers, stylesheets, and submenus for various URL patterns

Of course, this system only works if the users are managing their files via an HTML editor. But I think a lot of users could, and certainly most Web developers. I think there’s a fair number of situations where it could work very well.

And yes, this is simplistic. But it really bridges the gap between a big stack of HTML files and full-blown content management. Call it middle ground.

Comments (5)

Darren Chamberlain says:

I’ve been thinking about building a system sort of like this, except (and this is what got me thinking about it), I would use subversion properties for all the metadata. The crawler would need to be smart about extracting those properties – for each file it found, it would need to get a property list (svn proplist $FILE) and then extract the value for each property (%props = map { svn propget $_ $FILE } @props or something similar). Then, of course, everything is in svn, including the metadata.

I’d be interested to find out if anyone has implemented this system, or something like it. I’ve specced out this system, but not actually started building it. The main reason I want to move to something like this (in my case, away from Movable Type) is because I want everything under revision control.

Scott S. McCoy says:

This is a terribly archaic and flawed system. There are much better ways to address content management. Data within an environment as described above is completely useless outside of that environment.

A far better solution would be to mark up your data coherently, so atleast then, you could transform your data to a document type that had logical markup. HTML is hardly a storage solution.

This solution also depends on some very overly-heavy tools (PHP? Ghad) and does not address some of the issues that CMSs have been fit to address, such as revision control, inter-document dependency, cross referencing and organization.

A couple of PHP scripts (that should be replaced with either server side includes, XML Include, document transclusion, or some other properly fit solution) aren’t any way of “enforcing” consistency or creating coherent storage for your content.

Ramita says:

The question is of the efficiency of such a system.

T. Clark says:

I haven’t read your entire post, no time, but curious if you have ever tried CMSimple? Not a bad solution and close to what you are describing. VERY small footprint and very easy to learn, setup, etc. I’ve used it for a few sites, (e.g. mychristmasgreeting.com)


Jake says:

This post is 3 years old but your idea seems to be more popular now than ever with static site generators like Jekyll, Nanoc, and Webby.

Would love to hear your opinions on these!