Improved Search

By Deane Barker on July 11, 2004

I worked on improving the search on this site today. Search has been through a number of iterations. First, I used the basic Movable Type search. But it was slow and I wanted to do some interesting things with search.

So last year, I switched to using a SQL “LIKE” query to return two-tiered results, first from the title and keywords, then from the body and extended body. This has worked really well so far.

However, one thing bothered me about it: LIKE has no concept of word boundaries. This…

WHERE entry_text LIKE '%date%'

…returns matches for “date,” “update,” “datebook,” “candidateship,” etc. So, I changed it a while ago to this syntax:

WHERE CONCAT(' ',entry_text,' ') LIKE CONCAT('% ','date',' %')

Basically, I started looking for the search term with a space on either side. This turned out to be an astoundingly stupid way to do it. What happens when the search term ends a sentence? Or begins the entry? Newlines, periods, and question marks are not spaces. Duh.

Today, however, I found the right way to do it:

WHERE entry_text RLIKE '[[:<_3a_5d_5d_date5b_5b_3a_>:]]'

Those bracket-y things are MySQL’s character classes for word boundaries. So it’s like my “tack a space on either side” method, but it includes anything that’s not a word. Very handy.

(But why don’t I just use MySQL’s full-text indexing, you ask? Because this is a hosted server, so I have to live with all the default MySQL settings. And the default settings exclude any word of less than four characters from being indexed. So three-letter acronyms like PHP, PDF, JSP, etc. wouldn’t be in the index, and for a site like this, that’s kind of a showstopper.)

I also did a little hacking of the search phrase you submit. For instance, searching for…

windows update

…will give you different results than if you search for…

"windows update"

Essentially, I tokenize the string but group quoted passages together. So, this…

"four score and seven years ago" abraham lincoln gettysburg

…would get tokenized like this:

Array
(
    [0] => four score and seven years ago
    [1] => abraham
    [2] => lincoln
    [3] => gettysburg
)

Finally, I included some stop words to save my database some work. This…

The penguin is the mascot of Linux

…gets reduced to this…

Array
(
    [0] => penguin
    [1] => mascot
    [2] => linux
)

So, I’m hoping to shake a few bugs out of it in the next few weeks, then I’ll post the class so anyone who wants to can take a look at it.

Gadgetopia
What Links Here

Comments

  1. LIKE does indeed have reference of word boundries. where column like ‘text%’ …would match “textblah” and “textyo” but not “yotext”. where column like ‘%text’ …would match “blahtext” and “yotext”, but not “textyo”.

    gilbert.

  2. I have to disagree with you, Gilbert. I’m quite aware of the code you provided, but what does that have to do with word boundaries? If I did this…

    LIKE ‘date’

    …then it’s the same as…

    = ‘date’

    ..so I’m no better off. This…

    LIKE ‘%date’

    …would still hit on ‘update’. This…

    LIKE ‘date%’

    …would still hit on ‘datebook’. This…

    LIKE ‘%date%’

    …would still hit ‘candidateship’. The code I provided, however, is like this:

    LIKE ‘[any non-word character]date[any non-word character]’

    You’re not going to get that functionality using ‘%’.

    Deane

  3. I’ve found an interesting optimization, because MySQL’s regular expression search is pretty slow, you can get a considerable speed up using:

    WHERE entrytext LIKE ‘%date%’ AND entrytext RLIKE ‘[[::]]’

  4. WOW — stunning results from that code tip, Gavin. Query times immedaitely dropped from between 1 – 2 seconds down to less than .05 seconds. There is NO latency now. Fantastic. Thank you.

  5. Just a thought on full-text: you could just check the length of the string being submitted and if it’s less than [amount specified by host, in your case 4] go with the RLIKE solution, otherwise stick with the full-text indexing in MySQL which should be way more efficient.

  6. Antonio, I’ve considered your suggestion before, and I think it’s a good one. But with queries taking less than .05 seconds as it is, I guess I’m not looking to squeeze anymore speed out of it.

  7. Gavin and Deane, quick question if you don’t mind, just to clarify things.

    Say this is my current query:

    SELECT * FROM post WHERE post_body LIKE ‘%dog%’;

    If I add in the RLIKE stuff:

    SELECT * FROM post WHERE postbody LIKE ‘%dog%’ AND postbody RLIKE ‘[[::]]’;

    MySQL doesn’t like it and throws a wobbly. Am I reading your examples correctly? Should that work?

  8. That’s what I tried before, Deane, and it fails.

    Is this command something that only works on the latest version of MySQL? I’m on 3.23.49…

    When I try to count the result I get: Warning: mysqlnumrows(): supplied argument is not a valid MySQL result resource

    But if I remove the RLIKE stuff, it works fine.

    Sorry to turn this into a support post, but I guess if it doesn’t work on my system, maybe we ought to know why and possibly save other people some time :o/

  9. Yeah, I realized after the fact that the comment engine stipped out what it thought was HTML. Sorry.

    I’m on 3.23 as well. Based on that error message, it looks like you’re reading the error back from PHP. You need to find out the specific MySQL error. Do this:

    print mysql_error();

    Or run the SQL from phpMyAdmin. That’ll give you a much better idea of the problem.

  10. Ah, I ran it in PHPMyAdmin, and it gave me the error:

    1139 – Got error ‘invalid character class’ from regexp

    Any ideas, now we know the error?

  11. Any idea why a mysql WHERE clause created by the following code might not work perfectly?

    $where.=’ ( ‘.$field.’ LIKE “‘.$term.'” OR ‘.$field.’ LIKE “% ‘.$term.’%” OR LIKE ” ‘.$term.’%” OR ‘.$field.’ LIKE “‘.$term.’%” ) ‘;

    it works most of the time, but for some reason, when it develops the following WHERE clause, it doesn’t match a block of text it should.

    WHERE clause: ( searchtext LIKE “php4” OR searchtext LIKE “% php4%” OR searchtext LIKE ” php4%” OR searchtext LIKE “php4%” )

    text it should match: requires php4 mysql blah

    it matches ‘mysql’, ‘requires’, ‘req’, ‘blah’, ‘bl’… but not ‘php4’…. does it have to do with the numeral? if so, how do i get around that? it makes no sense to me… perhaps it’s too late…. anyone else run into this issue?

    (note: i would like to avoid the regex queries to the DB if at all possible, due to DB speed concerns.)

  12. Any idea why a mysql WHERE clause created by the following code might not work perfectly?

    $where.=’ ( ‘.$field.’ LIKE “‘.$term.'” OR ‘.$field.’ LIKE “% ‘.$term.’%” OR LIKE ” ‘.$term.’%” OR ‘.$field.’ LIKE “‘.$term.’%” ) ‘;

    it works most of the time, but for some reason, when it develops the following WHERE clause, it doesn’t match a block of text it should.

    WHERE clause: ( searchtext LIKE “php4” OR searchtext LIKE “% php4%” OR searchtext LIKE ” php4%” OR searchtext LIKE “php4%” )

    text it should match: requires php4 mysql blah

    it matches ‘mysql’, ‘requires’, ‘req’, ‘blah’, ‘bl’… but not ‘php4’…. does it have to do with the numeral? if so, how do i get around that? it makes no sense to me… perhaps it’s too late…. anyone else run into this issue?

    (note: i would like to avoid the regex queries to the DB if at all possible, due to DB speed concerns.)

  13. Based on the discussion here I generated a php script for searching MT entries that you guys might find interesting. That is until Tim Appnel releases his XSearch/Plus plugin. It also spits out the probability of the result matching the searched terms based on a simple rubric for scoring words based on title, text, etc.

    I posted the code up on the MT forums here: http://www.movabletype.org/support/index.php?act=ST&f=14&t=45349

    You can see it in action here: http://depository.unfoldedorigami.com/cupboard/

    Thanks!

  14. Are you still planning on posting the class for other people to look at? I’d be interested in seeing exactly how you got this working.

  15. Sure, I guess. I was just looking at it again. It’s a mess, but if you want it, send your email and I’ll send it to you.

  16. 1139 – Got error ‘invalid character class’ from regexp – It was my problem too…

    my solution: name RLIKE ‘(^| +)$search_word($| +)’

    (^| +) = ‘^’ – is the beginning or ‘ +’ one or more spaces

    ($| +) = ‘$’ – is the end or ‘ +’ one or more spaces

    It find whole worlds in the sentenses and it seems to work well…

  17. Problem is no0ne, what if the word is at the end of a sentence and not followed by a space? Like a comma or a period for example? Then it doesn’t match.

Comments are closed. If you have something you really want to say, email editors@gadgetopia.com and we‘ll get it added for you.