Data Globbing with MySQL Regex

By Deane Barker on July 13, 2004

As I become a more experienced developer, I’m learning when you should and shouldn’t break the rules. While following every rule of programming and data modeling is wonderful, sometimes you need to bend the rules for the sake of simplicity and expediency.

Always remember, an app in the hand is worth a thousand on the white board.

This being the case, lately I’ve been known to “glob” up data in database fields. Yes, I know this breaks the first normal form – that of atomicity – but there are times when doing it right would involve three more queries, two more database tables, another UI screen, etc. Often it makes the cure worse than the disease.

For instance, consider this little XML document as the contents of the “children” field for one of the records in my “church_attender” table:


Now if I never wanted to search for individual children, I would make no excuses for this. It saves us a database table, a join, and a ton of complexity in the interface. Life is good.

Searching a globbed up field is a problem, though. We alluded to it in this post when we said:

However, the problem is that the XML field is a black box that – on most database platforms – you can’t look inside. What if you want a list of articles written by a particular author? Well, you need to use SQL to get all the XML back, spin that collection, XPath into every single one to find the value author of the author node, then keep that record it matches.

So what if I want to find a person with a child named Gabrielle? Some databases (Oracle, for one), will let you do something like this:

SELECT * FROM church_attender
  WHERE XPATH(children,'/child/first_name') = 'Gabrielle' 

That’d be great, but I don’t have Oracle. However, given our experience this week with MySQL regular expressions, how unacceptable would this be:

SELECT * FROM church_attender
  WHERE children LIKE '%Gabrielle%'
  AND children RLIKE '<children>.*<child>.*<first_name> [line break]
  Gabrielle </first_name>.*</child>.*</children>' 

(Note that there are some extra spaces in there just so the lines would wrap.)

Yes, yes, I know the database Gods would frown on this, but given the enormous amount of complexity it would save us, is it acceptable? Does the good outweigh the bad?

Fishing for opinions here. Let’s hear them.

Comments (11)

Simon Willison says:

This highlights the main reason that storing XML in a database is a bad idea. Instead of hacking around the limitations of XML, the smart solution would be to properly normalise the data and spin off a separate children table. XML is great for transferring data between apps, and great for document markup, but it’s a lousy data storage format when you could be making full use of a relational database instead.

Deane says:

“the smart solution would be to properly normalise the data and spin off a separate children table.”

Oh, I agree. That’s the RIGHT way to do it. But there’s a lot of extra complexity involved there. The idea is to break the rules enough to strike a good balance between what’s right and what’s fast.

Simon Willison says:

I disagree that it’s extra complexity – after all, that’s how relational databases are meant to be used. Additionally, queries run against properly normalised tables will run a great deal faster than queries that use hacks to search against XML. Your RLIKE query forces the database to read in every single row in the table and run a regular expression over a column for it. With a separate children table, especially if it’s properly indexed, the database will be able to calculate a much more efficient query plan which avoids full table scans and makes the smartest use of the data. From a performance point of view a proper relational structure is a big win.

If you’re talking about speed of development, you need to bare in mind Larry Wall’s three virtues of a progammer ( ) - in particular Laziness. While the XML solution with the regular expressions may seem easier to develop, in the long run you may find yourself wanting to run other queries against the data which can’t be done using regular expressions (or you may find that your table’s have grown to the point where the poor performance of a regexp table scan requires a rethink of how you are doing things). The lazy solution then is to bite the bullet and normalise the tables now – it may well end up saving you a whole bundle of time later on.

Jonathan Hollin says:

I agree with Simon, normalise the tables now.

Having said that...

SELECT * FROM church_attender
  WHERE children LIKE '%Gabrielle%'
  AND children RLIKE '.*.* Gabrielle .*.*'"

... is a pretty cool hack – a clever short-term fix!

James M says:

Congratulations for owning up to something that, as programmers (read – hackers) we have all done at one time or another.

This is indeed a ‘cool hack’, however I shudder to think of the performance bottleneck this query would cause on a substantial dataset.

Joe says:

Well, the nice thing about XML is that, if he decides later that this should be fully normalized, it would be a pretty small script that could scan that table, read that XML, and pull it out into a normalized structure. It would depend on how well you’ve isolated your database layer as to how painful that refactor would be. Inline SQL in a PHP site? Ouch! Database access objects? No sweat.

Nate Thornton says:

I agree that this may not be the best example of putting XML into a database, but there are circumstances where storing XML in a database to pull out would be WAY faster than any normalized set of tables anyone could create. For example, we have a couple of fields in one of our databases that is in XML format. We pull this xml out of the field in the database and create dynamic forms from the information contained in the xml. It’s one pull of one field in the database, no regex’s, no anything of the sort. To do the same thing in database tables would take approximately 4-5 tables, that all reference each other, and it would end up pulling a lot more data out of the database to be able to get the same information that the simple xml structure can contain. (*Think of a windows form with 200 controls on it that is dynamically created at runtime). Anyways, it’s always a good idea to look at other ways of doing something. Just because there is a “Standard”, doesn’t mean it’s the best for every circumstance, especially when it comes to speed.

Ajeet Phadnis says:

I have used mysql to store the xml data. The way I do is to use mix of key columns with blobs. e.g. in SD – order module my schema is : erp_name, ordernr, person, blob. This works fine for me.

Thomas Nicolai says:

Well, I think it will work in both ways and depends on what you wanna do at the end. But there are a lot of advantages to build a schema like Ajeet has done. I’ve done it in the same way, too. It works perfectly and if you wanna use XML in a pretty heavy way (e.g. web services) you have to choose this option of dealing with the XML data. The best way is still to use the short term hack and hope that the mysql developers will integrate XPath.

nought says:

“hope that the mysql developers will integrate XPath”

does it not help that php5 has native XPath functionality? tied into MySQL doesn’t that accomplish the same thing? if it does help, please someone spell out what this allows us to do. doesn’t it allows us to build an object-relation bridge in php? and doesn’t that amount to being able to deploy a php-based (LAMP-based) zope/zodb? in the form of a php api? i dub it zoph.

Thomas Nicolai says:

“does it not help that php5 has native XPath functionality?”

No it does not help us cause how do you wanna search trough 10’000 data records with PHP in an efficent way? This is just not possible and is also not a new feature of PHP5. The SimpleXML queries and all that stuff in PHP5 is just a better and broader native support. Well the key question is wheter you have a native XPath support in your Database application or not! ORACLE HAS AND....


Check this out:

...and therefor this discussion can be closed :-)


Well what do we need more?