If you have a MySQL-based website with lots of records (like hundreds of thousands) and run a full-text search on them, you will probably run into performance problems. The first thing most people probably do is use LIKE with the % wildcard. This quickly becomes too slow, of course, because MySQL has to search through the entire text of every record of the table and can’t use any indexes.
The first solution most people think of is MySQL’s own full-text index capability. However, this doesn’t work very well if you run the query on many text fields per record. In my case, the search was even slower.
Sooner or later, you have to think about creating additional full-text indexes. At a seminar once I heard about how powerful the Apache Lucene project is, so that’s the first thing I looked into. However, I quickly found out that it’s very complicated and requires a larger learning curve than I was willing to invest. In the online documentation for Lucene somewhere it said that if Lucene is too low-level for you then to try Apache Solr, so I looked into that. Solr is based on Lucene and runs on Java, so that means you need to install Apache Tomcat if you haven’t already, which I didn’t. I looked into installing it on my XAMPP installation and realized I would need to either update XAMPP to a version that included it or add it manually. This was turning into a larger project than I wanted it to be even before I could begin to think about the search server itself. So I looked into something else.
Here are some lists of search engine software (search servers):
- http://en.wikipedia.org/wiki/Category:Free_search_engine_software
- http://onlamp.com/pub/a/php/2006/02/16/search-engine-showdown.html
Sphinx Search caught my eye because it was specifically designed to index MySQL databases, which is what I wanted primarily, although you can index just about anything else too, like XML files and web pages. It was also mentioned as particularly easy to implement.
This is true, although, of course, there are pitfalls, especially since, like most things web server, it is mainly aimed at Linux and not Windows installations. And this is the main reason for this tutorial, to help Windows users install and implement Sphinx Search with as little trouble as possible.
Resources that helped me:
- http://sphinxsearch.com/docs/2.0.1/
- http://sphinxsearch.com/forum/view.html?id=2972
- http://www.crankberryblog.com/2011/intalling-sphinx-on-wamp-localhost-windows
- http://sphinxsearch.com/forum/view.html?id=4986
Decide which release you want, generally the latest stable beta release as described here: http://sphinxsearch.com/downloads/
Download “Win32 binaries w/MySQL support” here: http://sphinxsearch.com/downloads/beta/
Unzip the zip file to C:\Sphinx. This directory should now include a few files, such as sphinx.conf.in, and subdirectories, such as api and bin.
Add the subdirectories data and log.
Copy the file C:\Sphinx\sphinx.conf.in to C:\Sphinx\sphinx.conf.
Open C:\Sphinx\sphinx.conf in a text editor.
Change the settings sql_user and sql_pass to match your MySQL installation.
Find and replace all occurrences of @CONFDIR@/ with C:\Sphinx\, at least in the uncommented lines.
Find and replace all occurrences of forward slash (/) with backslash (\) in these same lines. For example @CONFDIR@/data/test1 becomes C:\Sphinx\data\test1.
Leave the source and index settings as they are for now for testing purposes.
If MySQL isn’t running already, start it now.
Start cmd.exe as an administrator (click the Windows Start icon, type cmd in the search box without hitting Enter, right-click "cmd.exe", click “Run as administrator”). On Windows XP this is easier: no need to worry about administrator rights.
Go to your MySQL installation directory, for example enter cd c:\xampp\mysql.
Create the test database and fill it with data by entering:
mysql -u [youruser] -p < c:\sphinx\bin\example.sql (replace [youruser] with your user name).
Enter cd c:\sphinx\bin.
Index the test database by entering indexer.exe --config c:\sphinx\sphinx.conf test1.
Install Sphinx as a service by entering:
searchd.exe --install --config c:\sphinx\bin\sphinx.conf --servicename SphinxSearch
Start the service: Go to Control panel > Administrative tools > Services. Right-click on "SphinxSearch" in the list and click "Start".
Test the search by entering: search.exe --config c:\sphinx\sphinx.conf test. This should result in some matches for the search term “test”.
Copy the Sphinx Search PHP API from C:\Sphinx\api\sphinxapi.php to your site’s HTML directory, for example C:\xampp\htdocs\[yoursite]\sphinxapi.php.
Create a test PHP file with the following contents:
<?php
mysql_connect("localhost", "root", [yourpassword]);
mysql_select_db("test");
require_once('sphinxapi.php');
$s = new SphinxClient;
$s->setServer("127.0.0.1", 9312); // NOT "localhost" under Windows 7!
$s->setMatchMode(SPH_MATCH_EXTENDED2);
$s->SetLimits(0, 25);
$result = $s->Query("test");
if ($result['total'] > 0) {
echo 'Total: ' . $result['total'] . "<br>\n";
echo 'Total Found: ' . $result['total_found'] . "<br>\n";
echo '<table>';
echo '<tr><td>No.</td><td>ID</td><td>Group ID</td><td>Group ID 2</td><td>Date Added</td><td>Title</td><td>Content</td></tr>';
foreach ($result['matches'] as $id => $otherStuff) {
$row = mysql_fetch_array(mysql_query("select * from documents where id = $id"));
extract($row);
++ $no;
echo "<tr><td>$no</td><td>$id</td><td>$group_id</td><td>$group_id2</td><td>$date_added</td><td>$title</td><td>$content</td></tr>";
}
echo '</table>';
} else {
echo 'No results found';
}
?>
Run the PHP file in a browser. You should get a table of results that include the search term “test”.
Now it’s time to RTFM to implement this in your own application. You’ll need to adapt sphinx.conf to index your own MySQL database (http://sphinxsearch.com/docs/2.0.1/conf-reference.html) and adapt the API settings to suit your needs (http://sphinxsearch.com/docs/2.0.1/api-reference.html).
For example, in sphinx.conf you probably want to change the source src1 section to something meaningful. I use [database]_[table], for example source mydb_yourtable, since there can be more than one source section. I figure that avoids confusion, and if I have multiple databases/sources I can simply specify the list of indexes to search in the Query() call, for example:
$cl->Query ("test query", "db1_table1, db2_table3");
Change sql_db to your database.
Change sql_query_info to your table.
You also need to change sql_query to access the MySQL table you want to index. The first field must be a unique, unsigned, positive, integer ID. If your table doesn't have one, you need to add it to the table first. This is how you will later reference the rows found in PHP.
For example:
ALTER TABLE yourtable ADD id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
Adapt or delete the settings sql_attr_uint and sql_attr_timestamp.
If you changed the name of the source section then you need to change the name or comment out the section source src1throttled : src1.
You probably want to also change the index test1 section to something meaningful. I use [database]_[table] as above, for example index mydb_yourtable.
Change the source setting in the index section to the whatever you named the source above, e.g. mydb_yourtable.
You can also change the path directive to something more sensible than test1.
If you want wildcard support, set enable_star = 1. This means that star (*) can be used at the beginning and/or the end of the keyword. The star will match zero or more characters, similar to % in MySQL. A star in the middle of the keyword doesn’t seem to be supported (but see below). Nor is a question mark or underscore to match exactly one character. You also need to set min_infix_len = 1 for example.
If you changed the name of the index section above then you need to change the name in the sections index test1stemmed : test1 and index dist1 or comment them out.
Be aware that max_matches = 1000 or change this value in the searchd section.
Until and unless you decide to use real-time indexes, you may want to comment out all the lines in the index rt section or delete them.
In the searchd section I set preopen_indexes = 0 because otherwise I had problems with delta index updates (see below).
Index this database as we did above for the test source by entering:
indexer.exe --rotate --config c:\sphinx\sphinx.conf --all
The --rotate option means you don’t have to stop and restart the Windows service running the searchd daemon. However, if you change sphinx.conf, you should manually stop and start searchd. The --rotate option should be used only when reindexing. As of version 0.9.9-rc1 this is no longer absolutely necessary; a --rotate should be sufficient. However, it is usually best to avoid ambiguity.
If you’ve gotten this far and are using real data, then indexing will take a few minutes.
If you ever need to remove the service entirely, start cmd.exe (as an administrator) and enter sc delete SphinxSearch. Somewhere along the line the changes I made to sphinx.conf didn’t work till I did this and reinstalled the service.
Adapt your test PHP file from above to access the database and table you just indexed and load it in a browser.
Now that you’ve gotten Sphinx Search to run in your web application, you need to think about how and when to automatically reindex the data. As of this writing there is no way to automatically do a live index update, at least not with wildcard support. However, you can do an incremental update:
Set up sphinx.conf for delta index updates as described in the manual (adapting to your situation): http://sphinxsearch.com/docs/current.html#live-updates
Use the Windows Task Scheduler to index the main source once daily at slack times. I did this:
c:\sphinx\bin\indexer.exe --rotate --config c:\sphinx\sphinx.conf --all
Once daily at 4:00 am.
Use the Windows Task Scheduler to index the delta source every few minutes at peak times. I did this:
Once daily at 4:00 am.
Use the Windows Task Scheduler to index the delta source every few minutes at peak times. I did this:
c:\sphinx\bin\indexer.exe --rotate --config c:\sphinx\sphinx.conf mydb_yourtable_delta
Every 5 minutes from 6:00 am to 11:00 pm.
Only new records are indexed every 5 minutes. Modified records only get re-indexed once a day.
Every 5 minutes from 6:00 am to 11:00 pm.
Only new records are indexed every 5 minutes. Modified records only get re-indexed once a day.
You might want to put these commands in batch files.
Re-indexing from the Task Scheduler in Windows 7 didn’t work for me, but it did work in Windows XP. See: http://sphinxsearch.com/bugs/view.php?id=860
In PHP you then need to list both the main index and the delta index in the Query() function, for example:
$cl->Query ("test query", "db1_table1, db1_table1_delta");
Update: You can simulate for example %one%two% in MySQL with *one* << *two* in Sphinx Search. There's no way to simulate an underscore for exactly one character.
Update: *one* << *two* finds one two but not onetwo.
Re-indexing from the Task Scheduler in Windows 7 didn’t work for me, but it did work in Windows XP. See: http://sphinxsearch.com/bugs/view.php?id=860
In PHP you then need to list both the main index and the delta index in the Query() function, for example:
$cl->Query ("test query", "db1_table1, db1_table1_delta");
Update: You can simulate for example %one%two% in MySQL with *one* << *two* in Sphinx Search. There's no way to simulate an underscore for exactly one character.
Update: *one* << *two* finds one two but not onetwo.