Just what is metadata, anyway?

By Deane Barker on June 21, 2009

Content management authors and consultants are obsessed with “metadata."  You “add metadata” to your content, apparently to describe it and make it easier to organize, or something.  You have “metadata” this and “metadata” that.

But here’s the thing: what is “metadata”?  And how is it different from “data”?

When you first start learning about content management and start reading books written by people like Rockley and Boiko, you keep hearing about “metadata."  You get confused about this.  What is it?  How is it different from just good ‘ol data?

I examined this question in the middle of a session at Web Content 2009.  I don’t remember the session, but I remember asking a question which launched into a discussion of just what metadata was.  After this, I exchanged a couple Skype’s with Seth about it, and I think I’ve come a distinction.

Metadata is data added around content that cannot otherwise be structured.

Content management systems have their root in document management systems.  For these system, the concept of “metadata” makes sense.  The binary file – Word document, or whatever – was the “data."  You couldn’t add any structured information to this, so you tacked on “metadata” around it to help describe it.

I find another example of this with Ektron.  Way back when, Ektron didn’t do structured content (let me emphasize that this was a long time ago).  So, even today, when you edit content, you have a “metadata” tab that you can configure to contain certain fields of structured information.  You did this because you couldn’t add structured information to the HTML – HTML just doesn’t hold that kind of information.  So you have the “data” as the HTML, and the “metadata” as the structured information.

So, in these causes, we have content that cannot be structured – a binary file or raw HTML.  To add specific, granular information to this, we have to have some other framework for it.  Enter metadata.

So, is metadata still relevant?  Not in most cases.  In mainstream Web content management these days, there’s no need to claim to have separate “metadata” because you can usually always structure your content now.  Data is data.  Content is designed to be structured, and metadata would just be pieces of structured data, just like your page title or your page body.

Why is this important?  Because the concept of “metadata” is confusing for people who have never had to use it.  There are people who have never been involved in pure document management situations where it made semantic sense to say “data” was one thing and “metadata” was another, so this concept doesn’t make sense.  In 99% of WCMS situations, there’s no difference.  Data is data.

I remember way back when I was ready Rockley and Boiko for the first time, I kept thinking, “What is this ‘metadata’ they keep talking about, and why haven’t I ever used this?"  And with Ektron, I never understood what the “metadata” tab was all about since I started using Ektron with structured XML right from the get-go.

In most cases, metadata is just data.  If someone disagrees, I’d love to hear an argument to the contrary.

Comments (6)

Graham says:

That’s an absurdly narrow definition of metadata. If you have a blog entry, the timestamp and author name are still metadata whether they’re stored in the same place as the text of the entry or in a separate structure alongside it.

The bottom line is that a blog entry or a photo or most other kinds of content can happily exist without knowing who created it or when it was created. This extra information is metadata – data about the data.

Deane says:

You make an interesting point, and one I brought up in the debate at the conference.

The book “Using Drupal” promotes this test to determine when something should be defined via taxonomy or not:

If you can take the piece of information away, and the content still makes sense, then this can be defined via taxonomy

It’s a neat, elegant definition...but I still reject it. The definition is too fine. In your example, what you define as “extra information” or “metadata”...is still just data.

Where do you draw the line between what’s “actual content” and what’s not? If the actual content the body of the blog post? So, is the title metadata?

The line gets very, very blurry, and, in the end, what does it matter? How does calling something “data” and some other thing “metadata” really change anything at all?

Benxamin says:

I disagree that a timestamp on a blog post is metadata. The publish time(stamp) is so integral to the very function of a blog that sorting by time is the he preferred default. In fact, I haven’t seen an implementation of a CMS where articles weren’t sorted by time, at least initially. Removing the timestamp from a blog entry would not make sense within the context of a blog.

Seth Gottlieb says:

I think metadata is only relevant for assets that cannot themselves contain structure (like a MS Word file or an image or a video). Structured and semi-structured content contains all the information that it needs.

One thing that is truly “meta” in the world of web content management is all the distributed discussion that happens around the content. This post and all of its elements (title, author, publication date, publication status, keywords...) are all part of the article. But what about this comment? What about the link on delicous, the comment on Friendfeed, or the links on Twitter? Now that’s meta.

Jennifer says:

It sounds as if they are just notes. In my experience, metadata provides a brief description of what the content is about, so the DMS doesn’t have to parse every bit of content which could slow it significantly if there are a lot of documents.

John says:

I like the definition of meta-data as “data about data.” So yes, metadata “is just” data, but the same way a square “is just” a polygon.

What separates metadata from content depends on the purpose of people reading the content.

Maybe you can think of it as “what people look for” vs. “how they look for it.” In that sense, contra Benxamin above, the date is explicitly metadata because no one wants to “read” the date, except to help them locate the actual content they are interested in.