19
votes

Can anyone explain in simple words how a full text server like Sphinx works? In plain SQL, one would use SQL queries like this to search for certain keywords in texts:

select * from items where name like '%keyword%';

But in the configuration files generated by various Sphinx plugins I can not see any queries like this at all. They contain instead SQL statements like the following, which seem to divide the search into distinct ID groups:

SELECT (items.id * 5 + 1) AS id, ... 
       WHERE items.id >= $start AND items.id <= $end 
       GROUP BY items.id
..
SELECT * FROM items WHERE items.id = (($id - 1) / 5)

It it possible to explain in simple words how these queries work and how they are generated?

4

4 Answers

18
votes

Inverted Index is the answer to your question: http://en.wikipedia.org/wiki/Inverted_index

Now when you run a sql query through sphinx, it fetches the data from the database and constructs the inverted index which in Sphinx is like a hashtable where the key is a 32 bit integer which is calculated using crc32(word) and the value is the list of documentID's having that word.

This makes it super fast.

Now you can argue that even a database can create a similar structure for making the searches superfast. However the biggest difference is that a Sphinx/Lucene/Solr index is like a single-table database without any support for relational queries (JOINs) [From MySQL Performance Blog]. Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.

Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.

What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.

EDIT: Also please see the source code in cpp files like searchd.cpp etc for the real internal implementation, I think you are just seeing the PHP wrappers.

4
votes

Those queries you are looking at, are the query sphinx uses, to extract a copy of the data from the database, to put in its own index.

Sphinx needs a copy of the data to build it index (other answers have mentioned how that index works). You then ask for results (matching a specific query) from the searchd daemon - it consults the index and returns you matching documents.

The particular example you have choosen looks quite complicated, because it only extracting a part of the data, probbably for sharding - to split the index into parts for performance reasons. And is using range queries - so can access big datasets piecemeal.

An index could be built with a much simpler query, like

sql_query = select id,name,description from items

which would create a sphinx index, with two fields - name and description that could be searched/queried.

When searching, you would get back the unique id. http://sphinxsearch.com/info/faq/#row-storage

1
votes

Full text search usually use one implementation of inverted index. In simple words, it brakes the content of a indexed field in tokens (words) and save a reference to that row, indexed by each token. For example, a field with The yellow dog for row #1 and The brown fox for row #2, will populate an index like:

brown  -> row#2
dog    -> row#1
fox    -> row#2
The    -> row#1
The    -> row#2
yellow -> row#1
0
votes

A short answer to the question is that databases such as MySQL are specifically designed for storing and indexing records and supporting SQL clauses (SELECT, PROJECT, JOIN, etc). Even though they can be used to do keyword search queries, they cannot give the best performance and features. Search engines such as Sphinx are designed specifically for keyword search queries, thus can provide much better support.