While working on the Benchmarks search I wanted to try and provide a feature I find useful on Google and other search engines: word form expansion (lemmatisation). A little research showed to me that this would require more work than we really should be spending on search functionality. Especially considering that the built-in MySQL full text search capability is sufficient for our needs. So I decided to focus on a feature that would still provide value but require little time: word stem expansion.
Purpose and Need
As I mentioned, we’re using the full text search capability of MySQL as the core of our search functionality. By default, MySQL performs a natural language full text search. This mode has a few restrictions that can affect the results that are returned, but most notable for our purposes is that a word provided by the user must match exactly a word in the record. What this means is that records with a variation of the word (such as a plural when a singular was provided) will not be returned. For example, if you search for “atom,” MySQL will not return records that have “atoms” or “atomic” but not “atom.”
I’ve found the easiest way to address this limitation is by performing the search using the keywords as typed, but then adding on results that have matches to the stem of the keywords. By using these two full text search methods in conjunction we can get the best results for our user.
Finding the Stem
The hardest part of this whole process was determining a way to find the stem for a word. Initially I developed a simple function that stripped plurality from a word. Is was simplistic, but provided a good sample of how effective stemming could be. A little more research led me to a PHP extension that uses an established algorithm to determine a word stem. The extension is available from the PECL PHP library and is called “stem.” Installation is simple. On *nix you first install the extension:
sudo pecl install stem
Next, update the php.ini file to enable the extension by adding the following line:
Finally, restart the web server for the new setting to take affect:
sudo /etc/init.d/apachectl graceful
On Windows you first download the dll from pecl4win.php.net (PECL installation actually involves compiling the extension binary which is not quite as straightforward on Windows). Place the dll in your php/ext directory. Next, update the php.ini file to enable the extension by adding the following line:
Finally, restart IIS for the settings to take affect.
How to Use
Performing a search with stem expansion using MySQL’s full text search functionality is actually quite easy. First we need to break the search terms up so that we can stem them. To do this we need to isolate each word in the search phrase. We can do this by splitting the search phrase into an array. It’s impossible to know exactly what the user will type, so we can be greedy and use something like:
$araKeys = preg_split('/W/', $strSearch);
Then we need to modify each keyword so that we have its stem:
$araKeys = array_map('stem', $araKeys );
Recombine the keywords into a new boolean search phrase:
$strSearchBool = join('* ', $keysArray) . '*'
Finally, we combine the original natural language search with a boolean mode search:
MATCH(fulltext_index_columns) AGAINST ('$strSearch') OR MATCH(fulltext_index_columns) AGAINST ('$strSearchBool' IN BOOLEAN MODE)
Order the results by the sum of the relevance values for the full text searches and you’re done:
ORDER BY (MATCH(fulltext_index_columns) AGAINST ('$strSearch')) + (MATCH(fulltext_index_columns) AGAINST ('$strSearchBool' IN BOOLEAN MODE)) DESC
The two combined provides a good sense of how well a result matches. Records that match the entered search terms exactly will bubble to the top of the list since they match on both the natural language and boolean searches. Records that match only on the stem-based boolean search will come next, and the more words they match the higher in this secondary list they will be. These stem-based matches would not be returned by the natural language search since the exact search terms were not present in the records.
MySQL has stated a desire to enhance full text searching with stem and proximity information. The implementation of these features will negate the need for this to be done via the hack described above. At that point this extra code should be removed, though I doubt it will cause problems with the search results if left in place.