By Deane Barker | January 24, 2010 | 4 Comments
I going to try and impugn one of the great concepts of content management: metadata. I’m going to argue that in the world of Web content management (WCM), it doesn’t really exist. Well, it might, but if it does, it’s awfully slippery to define and defining it doesn’t give you much value anyway.
Classically, “metadata” is “data about data.” People love that definition. Every time I see “metadata” defined anywhere, it’s quickly followed by that phrase (much like I just did right there).
The idea is that metadata describes another piece of data. So, the metadata is not the data, it’s just a description of the data. However, this raises the obvious question: what is “the data”? How do we sort out what is metadata (second order data) and what is data proper (first order data)? It’s harder than you think.
I propose that there are two major theories of metadata – two major schools of thought people invoke when they say, “this is metadata.”
The Geography Theory says that metadata is such because it’s stored “somewhere else” than data proper.
In some situations, this is obvious. In particular, the current concept of metadata in content management came out of the document management space (where most of the *CM spaces originated).
With document management, the core object under management is some binary file, like a Word document. This is clearly “the data.” And in most document management systems, you could “decorate” this data with additional data to describe the file, be it a category, author, status, whatever. And today, inside Word, you can go to “Properties” and add information, and this is logically metadata because the actual words in the document are “the data.”
So, in situation when you have a clear core of “data” and the information about this data is “somewhere else” (see how this is getting weird already?), then it’s pretty clear.
A lot of this geography was dictated by format. Back in the “golden age” of document management, applications like Word couldn’t store extra data like this, so the data was in the document management system instead (“somewhere else”).
Early versions of Ektron had the same problem – back then, Ektron only managed HTML content, rather than more structured things like XML. So, Ektron had a tab called “Metadata” for you to store other information that you couldn’t somehow embed in HTML. This tab still exists, even though with later versions of Ektron, you can pout most of this information directly inside XML-based content. (What gets odd is that the datatypes differ between what’s on the “metadata” tab and what you can put in the XML, which sometimes forces you into put something under “metadata” when you’d rather just put it over with the core data.)
In other systems – especially Web content management systems – this distinction between metadata and core data breaks down. In EPiServer, for instance, there is no concept of content being in one place or another – all properties of a page are in the same “place” (under the same “interface umbrella,” if you will), so it’s all just data. Nowhere can I say, “this is data…and this is metadata…” etc.
This situation is very common in WCM. There is no “somewhere else.” All the data relating to a logical piece of content is stored and administrated together, which completely negates The Geography Theory.
The Visibility Theory says that metadata is data that’s used for some purpose other than publishing to a consumer.
Content has “publishable” information, which is data we intend to push to the consumer – the title of a news article, is an obvious example. This is the data proper.
But what about data that’s for administrative purposes only? One of my clients was just asking me yesterday about “metadata” to help search for content on the admin side of the site. They wanted to be able to tag or otherwise identify pages so they could find certain pages later in amongst hundreds of others. This is information that would never be published to the end user.
Similarly, at Gilbane Boston last month, I took a question from a woman who wanted to use a taxonomy system to categorize the quality and review state of various content. This is very much information that will never be published – you don’t want you consumers to see you category label of “really crappy stuff I wrote after a eight-martini bender,” after all.
Both of these are perfectly reasonable endeavors, but do they define “metadata,” as opposed to data proper which is published to the end user? If we include explicitly defined information like this as metadata, do we also include systemic information? Is the Published Date considered metadata? What about the applied permission set?
In most systems, there’s no way to really define what data is going to be rendered to the consumer in the presentation layer. Maybe your template will output Published Date, but maybe it won’t, and there’s little way for the system to know that in most cases (few systems have any reason to parse their own presentation templates).
Furthermore, most WCM systems don’t really care (for lack of a better word) if you output a specific piece of the data to the end user. It manages, stores, and treats all content data the same way – If you choose to output Datum X on your page, that’s up to you. A WCM system has yet to ask me why I’m storing any particular piece of data.
So, in the end, when talking about WCM, I think that the use of the term “metadata” can really muddy the waters, especially for people new to the field. Unless you explicitly acknowledge one of the theories above and explain that “this is the operative definition we’re going to use for ‘metadata’,” it’s easy to get people confused by it.
But even if you do define this, and everyone knows what you’re talking about, what have you gained? In a WCM system where all pieces of data for a logical piece of content are jumbled in together, differentiating between what is “data” and what is “metadata” really has little practical value.
What Links Here
11 years ago when I worked on a website for categorizing and storing policy documents we struggled with this issue. We built a WCM system and wanted to only show certain data (metadata) and kept the files (PDF) in another storage system. The trick for us was we actually created all source documents in SGML and PDF was just a display format. In the end, we dropped the idea of metadata and just showed key elements from our structured document and ignored the presentation file.
I agree with you. Unless you are dealing with a system that is storing files that have no way of self describing themselves, there really is no metadata.
I usually go with the visibility theory, but openly acknowledge that it’s somewhat arbitrary. If you want to go completely crazy, just start with “When is it data, and when is it content?”
So lets get competely crazy, it’s carnival season anyway…
Basically, I agree with your finding (metadata is a useless concept in WCM), and let me add: but not restricted to WCM). So here comes more stuff to stirr up confusion:
If the content you manage is expessed by xml document instances, what are the xml schema documents describing the content types?
With our onion.net CMS (frequently used for customer facing websites so call it WCM if you like) the content types of any individual project are defined by xml schema documents. Actually, you can edit the xml schema at any time.
The xml schema is stored in the same database with other content, so it is data according to the geography theory.
The xml schema is stuff only relevant to information architects. Thus, it’s never published to consumers, and even the average editor won’t ever see it. It must be metadata according to the visibility theory.
By the way: the same is true with documents describing xsl-transformations.
Seems like we have a nice “closure problem”, and the definitions or the concept of metadata versus data are not adequate in this context. I advocate the latter.
My personal conclusion: content management is about the management of content; and content is a more general concept, than data and/or metadata.
I believe, that most of the confusion is generated by marketing-driven classifications, which do not clearly differentiate the product space. There are 1000++ systems, which are regarded more or less as (W)(E)C(D)M(S) – please tick appropriate – but we are short of clear definitions. We all want to drive the E-class, won’t we?
Better concepts are needed in our industry to help customer’s orientation. A vinyl record is not a CD is not a Philips cassette, is not an mp3-download, even if they all contain Highway 61 revisited.
What makes it “meta” is that it’s used for some purpose other than just reading. Typically this means it’s used to share content, or apply particular structures or workflows. I agree that in terms of the technology there’s virtually no difference, except that the underlying CMS schema can be structured to deliver metadata more quickly than the rest of the data: i.e. to generate navigation, you don’t need the full article you just need its content type, date, subject, title, etc. People get way too tied up in this kind of thing: just think about it all as content, who or what’s going to use it and how you can make it useful.