I’ve been toying with an idea lately, and instead of actually doing it (don’t have the time), I’m going to throw it out here for fun. My idea is for an extremely simplistic content management system – one based on HTML files and a scheduled file system crawl.
First, some things I believe:
Creating content on the Web is a huge drag. Yes we can do great things with HTML forms and Ajax and WYSIWYG editors and all that, but in the end, you’re still composing content in a Web page. Even the best environment can’t compare to the functionality in FrontPage, Dreamweaver, or Nvu (if they fixed a few flaws).
Content is generally more strictly managed in most systems than it really needs to be. The fact is, most content we deal with is a title, a big chunk of text, and perhaps a few additional properties.
A movie review, for instance, is a title, a big chunk of text, and perhaps a movie poster and the number of stars. A product listing is a title, a big chunk of text, and maybe a handful of specifications.
There are always exceptions, of course, but the fact remains that you don’t see a lot of really complicated, relational databases jammed into content management systems. Most systems manage pretty loose content.
At the same time that I’ve come to believe what I’ve written above, I’ve had some experience managing some larger static sites. When you manage a static site of 100 pages or more, you quickly run into two big problems:
Enforcing consistency between pages
Managing menus and index pages
Here’s how I handle the first one, and here’s a theory I’d like to try on the second.
I have a PHP prepend and append file, so every hit to a PHP page gets “bookended” by these two files. The prepend file starts buffering, and the first thing the append file does is read the buffer into a variable. Nothing too out of the ordinary there.
But then the append file compares the URL pattern to a series of regex expressions, stopping on the first one it matches. Based on the match, the append file inserts HTML just under the open BODY tag, and just above the close BODY tag. It also inserts a suffix to the TITLE tag, and a stylesheet link just under the TITLE tag. (And I’m modifying it this weekend to also insert a submenu, when specified.)
What this means is that when I create a static HTML page for a site under this system, I don’t have to worry about header or footer includes, TITLE tag format, stylesheets – anything like that. I just compose in simple, unformatted HTML, and when the page is requested, it gets processed. Put another way, all the formatting of the page, from headers to footers to styles, is centrally controlled. The page author has no choice.
It also means you can vary things greatly by section. The main part of your site can look a certain way, and a subsection (designated by URL pattern), can look completely different. Put a page in Folder A and it looks one way, but move it to Folder B and it looks completely different.
This system has worked extremely well for me, and has enabled me to keep a hundred or so static HTML pages totally consistent with each other. What it ultimately means is that the page on the file system stays “pure.” There’s no need for PHP code, file incudes, stylesheet references, etc. All that’s in the file is the actual content that’s supposed to be there.
Managing Menus and Index Pages
How do I maintain an index page of news articles without hand-coding it and updating it whenever I add or remove a page? How do I keep track of what goes on the front page of the site? And if I delete an HTML file, how do I know all the index pages in which it appears so I can remove reference to it?
The bottom line is that even when you have your static files managed as perfectly as possible, you still have problems relating all this content and keeping it organized and accessible. So how do you cross that chasm without going to full-blown content management?
Here’s an idea:
Create a scheduled process that crawls your HTML files and converts them to database records. Then use these records to power your index pages and other dynamic sections of your site.
Say I have a folder full of movie reviews. Each one is a static HTML page. I want to have an index page listing all the reviews. This is actually pretty simple – I just have a scheduled process that crawls the folder, extracts the TITLE tag from each file, and logs it with the filename in a database table. Run that process once an hour, and then pull from the database table to run your index page.
But what if I wanted to have the star ratings and a one-line summary of the review on the index page too? Where do I put that stuff? In the page META. Have a META tag for “description” and another for “star_rating.” Then, when your process crawls the folder, log those in separate database fields.
(Yes, yes, there are potential datatype issues here. But your users just need to be careful and be notified when there’s been a problem. If the crawler finds anything other than an integer in the “star_rating” META tag, it skips it and logs an error.)
Some other thoughts on this.
- You can create META for about anything you need. You could have a META field for “front_page” and set it to “1” for any movie review that should be on the front page. When that review should come off the front page, change the META tag and re-index.
There’s no need to change an admin interface when you do this (the HTML file is the admin interface). And if you put your META fields in a table in key-value format, you don’t even have to change your data model when you start or stop using a certain META tag. The indexer would just log everything it found without question.
Foreign keys could be loosely emulated by using the filename of the foreign record as a key. For instance, the “author” META tag could be “dbarker.” There would be a file named “dbarker.html” in the “authors” folder which is being managed by the same process. All it takes is a simple join to bring the two tables together.
Full-text search would be even simpler yet. When your process crawls the files, just have it log the entire text of the page (sans HTML tags) in another field. (Please tell me you saw that coming.)
What if you delete an HTML file? Until the process runs again, you have a record in the database without a file.
To handle this, use the 404 as an alert. When a page is not found, have the 404 page look for the database record corresponding to the missing file and disable it so that page reference instantly comes out of all index pages.
Ideally, instead of a scheduled process, find one of those little monitoring apps that will watch a folder and run a batch file when it detects a file change. Use this to instantly re-index single files when they change.
You could have a file or table that specified META that should be removed before display, if you didn’t want all META to be sent to the browser. If you want some META there for the indexer, but you don’t want the user to see it, the append file mentioned earlier could suck that out before flushing the buffer.
So that’s the gist of it. The HTML file itself really becomes the database editing interface – it’s the bridge between the user and the database. The user can manage the file however he or she feels like it. At a certain interval (or on demand), the files get converted into database records which are awfully easy to query, manipulate, and display.
(Note that anything I said here about database tables goes the same for search engine indexers. A search engine like Swish-E could do pretty much everything I’ve described and it’s monstrously fast. Running on a 2.4GHz P4, Swish-E indexes all 4,600 HTML files on this site in eight seconds of CPU time. See this post.)
I envision a simple Web interface where the site admin can login, then:
traverse the HTML folder structure and view files
re-index individual files or entire folders on-demand
kick off a full-scale index of the entire site
browse the logged meta
run test SQL
see the results – including error reports – of previous crawls
Specify headers, footers, stylesheets, and submenus for various URL patterns
Of course, this system only works if the users are managing their files via an HTML editor. But I think a lot of users could, and certainly most Web developers. I think there’s a fair number of situations where it could work very well.
And yes, this is simplistic. But it really bridges the gap between a big stack of HTML files and full-blown content management. Call it middle ground.