A Business Information Publication Standard and Google’s Power to Make It Happen

By Deane Barker on August 30, 2004

It’s time we extend the robots.txt concept to information about businesses.

First, let’s take a quick detour into robots.txt for a second —

In order to tell a search engine how to spider a Web site (or not), webmasters can stick a text file called “robots.txt” in their root directory with information spiders can use. This works because it’s useful and it’s ridiculously simple — everyone sticks the same file, named the same thing, in the same place. It’s so simple, it can’t NOT work.

MT-Blacklist does the same thing. You can put your blacklist in a text file called “blacklist.txt” in the root of your site so people can see what you’re blocking.

For the sake of standards, it is recommended that the file be named blacklist.txt and reside in the root directory of your website. The possible network effect here is certainly delicious.

This isn’t complicated — stick the same info in the same place and people will know where to get it.

Okay, back to the point —

Lets make up an XML spec for information about your business. Like this:

  Deane's House of Pancakes

1600 Pennsylvania Ave

  We make round, fluffy crap.


Obviously, this is absurdly simple, but you get the idea. You could have fields for your business phone number, fax number, general email, directions on how to get to your office, stock ticker symbol, customer service phone number, etc. Essentially, anything anyone would want to know about your business and for which they would (1) have to look up on your Web site, or (2) make a phone call to your receptionist.

Now, lets all put this file in the root of our Web site and call it “info.xml.” That way we all know where it is, and we can all retrieve it. Now, every business has a common URL pattern where a program can find easily digestable information about the business. It’s not hard to imagine what we could do with this.

For instance, Outlook could parse the domain name of the email address of all your contacts, go looking for this file for each one, then store the information with the contact (refreshing it every 30 days or so).

Online white and yellow pages could have a field day with it — you just give them your URL, and all your information is self-updating. Search engines could present this information alongside search results for your company. Etc.

Of course, this only works if everyone does it. And here, my friends, is the one, single thing that would have to happen for everyone to do it: Google adopts it. That’s it. If Google announced tomorrow that they were going to do something like this, and released the spec for it, we’d see info.xml files start to hit the Web within a few hours. We’d have massive saturation within a month.

Google is already pushing beyond search with localization results. They’re nailing down addresses of sites they visit, then presenting them in graduated radii from the city center.

Why not eliminate the parsing step and just ask people to put their actual address in a file in a common location? And while they’re at it, have them put a bunch more information there as well. Once search engines starting spidering and parsing this stuff, it’s amazing the level of detail and accuracy you could get for online directories and other business information sources.

And, again, the only thing standing between this idea and reality is Google. Google is such a juggernaut that they could — in the words of Jean Luc Picard — “make it so” just by announcing that they wanted to it to happen. Companies would fall all over themselves to deliver it. Like Microsoft, Google is in a position to drive standards simply by virtue of its position.

So, Google, snap to it. The world is waiting.



  1. That’s a good article, but I don’t agree with the argument that fishing for a file is bad because there’s no way to know if a file is there until you ask.

    There’s no way to know if anything is anywhere until you ask. This author claims that 404s are a Very Bad Thing at their core. I disagree — a 404 is a valid response to an implied question.

    A 404 is bad if there’s supposed to be something there, but if it’s optional whether something is there or not, then a 404 is a legitimate response — the client asks, “Is this there?” and the browser responds — via 404 — “No, it’s not.” This is perfectly reasonable.

    The author is upset because he gets so many “hits” to a robots.txt file that’s not there. Why is this a bad thing? Each one of those requests is a question, and each 404 is an answer. Request-response — it’s what the Web was built on. If he’s that concerned about bandwidth, re-config the Web server to jut send back a 404 header or something — but you need to respond with something or else the whole idea falls apart.

    He does make a good point about subfolders as such, though. I like the question, “What is a Web site?” Is it a domain name, or is it a base URL? Can there be more than one Web site (more than one business) represented at the same domain name.

    Food for thought.

  2. I don’t think the 404s are such a big issue so much as the loss of control over namespace is. The whole point of the domain name system and the hierarchical way that sites are organised is that each domain gets its own namespace which the owner has full control over. robots.txt (and similar schemes) violate this ownership – suddenly part of your namespace has been stolen from you. You can’t opt out, because people will assume that anything at that URL is in the format they are expecting – so your namespace has been co-opted without your permission. On top of that, the idea simply doesn’t scale – as Joe points out, there’s no central registry of these things and the more of them there are the harder it is to keep track of them and manage their deployment.

  3. “so your namespace has been co-opted without your permission.”

    Such the nature of these things. To make this work, that has to be allowed. It’s a small price to pay, and I’m personally more than willing to pay it.

    “…there’s no central registry of these things and the more of them there are the harder it is to keep track of them and manage their deployment.”

    The only one you have to manage is your own. That’s like saying that there’s no central registry of Web sites.

  4. I’m with you on this, Deane. robots.txt (and others like what you’re suggesting) doesn’t violate ownership any more than having a placard near the front door of a house or business with the street address on it.

  5. “The only one you have to manage is your own. That’s like saying that there’s no central registry of Web sites.”

    The only one? robots.txt – plain text, special format, in the root of the domain favicon.ico – particular image format in the root of the domain p3p.xml – a particular XML format in the /w3c/ folder. info.xml – yet another different XML format in the root folder a.b – another file with another format that you haven’t heard of yet in the /a.b/ folder …

    A central registry of web sites is irrelevant – when you can’t place a file on your website because you aren’t sure whether or not its name is significant in some metadata scheme, and there is no central list of metadata filenames, that’s a problem.

    “Such the nature of these things. To make this work, that has to be allowed.” Not only does it not have to be allowed – see Tim Berners-Lee’s suggestion, for example – but it’s not even desirable.

    “The author is upset because he gets so many “hits” to a robots.txt file that’s not there. Why is this a bad thing?” Because it’s a poor design and expensive in time, requests and bandwidth. GET /favicon.ico HTTP/1.1 404 GET /robots.txt HTTP/1.1 404 GET /info.xml HTTP/1.1 404 GET /w3c/p3p.xml HTTP/1.1 404

    and one more for every new idea anyone tries to implement.

    compared to: Metadata:www.wherever.com No metadata supported.

    That’s it, now and for for all future protocols. One request, you return which you support.

  6. While I appreciate all your arguments, the bottom line is that to require central registration of this stuff is to doom it to failure. Here’s what I want:

    1. The ability for the information to be discovered via an HTTP call.

    2. No requirement for this information to be registered or otherwise advertised outside the province of the Web site to which it applies.

    The robots.txt model works for this. If you have a better idea that meets my requirements, let’s hear it. If not, then my endorsement of the robots protocol remains.

Comments are closed. If you have something you really want to say, tweet @gadgetopia.