Validating HTML

By Deane Barker on June 5, 2004

I’ve been playing around with the W3C HTML validator, and I’ve found, sadly, that there’s no easy way to get this page to validate. There were some problems that I fixed, but when I try to validate against 4.01 Transitional, I get about 50 errors related to the use of “&” in URLs.

Apparently you’re supposed to use the HTML entity for the ampersand (“& a m p ;”) even in URLs. But since this entitiy isn’t present in the URL in the address bar of the browser, and that’s where you generally copy the URL from, how are you supposed to convert these without manually picking through every URL you use? You could try to get funky with regular expressions, but I can’t imagine that would work perfectly in every case.

This brings up a larger point in that you can’t really expect to validate a site where a large part of the HTML of the page is provided by people other than the original Web developer. Every entry on this page — comprising the entire middle section — can be entered by someone else, and how can I make sure they’re entering valid HTML markup?

This is where HTML Tidy integration will work very well in PHP 5. Using this tool, you can validate HTML that people enter before you store it in the database, or before you output it. You can make sure all tags are closed, all tags match, etc. so perhaps you can hope for some sort of valid markup.

But, in an even larger sense, does validation matter much? I’ve never gotten any comment from anyone about the validation of this site. So what that I’m throwing 50 errors because of ampersands in URLs — can someone provide me with a valid (excuse the pun) reason why this matters?

I understand problems can occur from gross misuse of the HTML spec, but are all validation errors created equal? My apparent misuse of ampersands has got to rank pretty low on the sin list.

Gadgetopia
What Links Here

Comments

  1. Yes, ampersands must be written as & in HTML (this includes attribute values). This is to differentiate them from the beginning of an entity. Naturally most browsers can cope with a lonely & in a URL.

    However I always make sure to write &. For one thing, I don’t want 50 trivial errors to hide the 51st real error which might matter, and I the HTML validator is useful in finding those too.

    With a content management system I wrote some years ago (Onpage2.com) all user-input would be validated by Tidy HTML before it would be stores as XML snippet in the SQL database. You might not think too much of HTML validation, but if you use XML (to have it be XHTML in the end) validation is absolutely necessary. You cannot handle XML objects which don’t validate, really.

    Talking about validation, just found some bugs in my current template. Bugs always reintroduce themselves if you don’t constantly validate!

    Good luck with your ampersands!

  2. Just noticed you don’t correctly escape ampersands in comments either, so some of the meaning above got lost — you should always convert all “&” entered by the user to “& a m p ;”…

  3. There is a plugin for MT that will perform the ‘&’ to ‘& amp;’ conversion for you, or you could hack one in yourself.

  4. Thats why W3C recommends using semicolons rather than ampersands to separate key/value pairs in the query string. Quite simple solution really.

Comments are closed. If you have something you really want to say, email editors@gadgetopia.com and we‘ll get it added for you.