0
votes

We have a big index, around 1 Billion of documents. Our application does not allow users to search everything. They have subscriptions and they should be able to search only in them. Our first iteration of the index used attributes, so a typical query looked like this (we are using PHP API):

$cl->SetFilter('category_id', $category_ids); // array with all user subscriptions
$result = $cl->Query($term,"documents");

This worked without issues, but was very slow. Then we saw this article. The analogy with un-indexed MySQL query was alarming and we decided to ditch the attribute based filter and try with a full text column. So now, our category_id is a full_text column. Indeed our initial tests showed that searching is a lot faster, but when we launched the index into production we ran into an issue. Some users have many subscriptions and we started to receive this error from Sphinx:

Error: index documents: query too complex, not enough stack (thread_stack_size=337K or higher required)

Our new queries look like this:

user_input @category_id c545|c547|c549|c556|c568|c574|c577|c685...

When there are too many categories the above error shows up. We thought it will be easy to fix, by just increasing thread_stack to higher value, but it turned out to be limited to 2MB and we still have queries exceeding that.

The question is what to do now? We were thinking about splitting the query into smaller queries, but then how will we aggregate the results with the correct limit (we are using $cl->SetLimits($page, $limit); for pagination)?

Any ideas will be welcome.

1

1 Answers

0
votes

You can do the 'pagination' in the application, this is sort of how sphinx does merging when quering distrubuted indexes.

$upper_limit = ($page_number*$page_size)+1;
$cl->setLimits(0,$upper_limit);
foreach ($indexes as $index) {
   $cl->addQuery(...);
}
$cl->RunQueries()
$all = array;
foreach ($results) {
   foreach (result->matches) {
      $all[$id] = $match['weight'];
   }
}
asort($all);
$results = array_slice($all,$page,$page_size)

(I know its not completely valid PHP, its just to show the basic procedure)

... yes its wasteful, but in practice most queries are the first few pages anyway, so doesnt matter all that much. Its 'deep' results will be particully slow.