I worked on improving the search on this site today. Search has been through a number of iterations. First, I used the basic Movable Type search. But it was slow and I wanted to do some interesting things with search.
So last year, I switched to using a SQL “LIKE” query to return two-tiered results, first from the title and keywords, then from the body and extended body. This has worked really well so far.
However, one thing bothered me about it: LIKE has no concept of word boundaries. This…
WHERE entry_text LIKE '%date%'
…returns matches for “date,” “update,” “datebook,” “candidateship,” etc. So, I changed it a while ago to this syntax:
WHERE CONCAT(' ',entry_text,' ') LIKE CONCAT('% ','date',' %')
Basically, I started looking for the search term with a space on either side. This turned out to be an astoundingly stupid way to do it. What happens when the search term ends a sentence? Or begins the entry? Newlines, periods, and question marks are not spaces. Duh.
Today, however, I found the right way to do it:
WHERE entry_text RLIKE '[[:<_3a_5d_5d_date5b_5b_3a_>:]]'
Those bracket-y things are MySQL’s character classes for word boundaries. So it’s like my “tack a space on either side” method, but it includes anything that’s not a word. Very handy.
(But why don’t I just use MySQL’s full-text indexing, you ask? Because this is a hosted server, so I have to live with all the default MySQL settings. And the default settings exclude any word of less than four characters from being indexed. So three-letter acronyms like PHP, PDF, JSP, etc. wouldn’t be in the index, and for a site like this, that’s kind of a showstopper.)
I also did a little hacking of the search phrase you submit. For instance, searching for…
…will give you different results than if you search for…
Essentially, I tokenize the string but group quoted passages together. So, this…
"four score and seven years ago" abraham lincoln gettysburg
…would get tokenized like this:
 => four score and seven years ago
 => abraham
 => lincoln
 => gettysburg
Finally, I included some stop words to save my database some work. This…
The penguin is the mascot of Linux
…gets reduced to this…
 => penguin
 => mascot
 => linux
So, I’m hoping to shake a few bugs out of it in the next few weeks, then I’ll post the class so anyone who wants to can take a look at it.