2
votes

We've got 50,000,000 (and growing) documents which we want to be able to search.

Each "document" is in reality a page of a larger document, but the granularity required is at the page level.

Each document therefore has a few bits of metadata (e.g., which larger document it belongs to)

We originally built this using Sphinx which has served quite well, but is getting slow, despite having quite generous hardware thrown at it (via Amazon AWS).

There are new requirements coming through that mean we have to be able to pre-filter the database before searching, i.e. to search only a subset of the 50M documents based on some aspect of the metadata (e.g., "search only documents added in the last 6 months", or "search only these documents belonging to this arbitrary list of parent documents")

One significant requirement is that we group search results by parent document, e.g. to return only the first match in a parent document in order to show the user a wider range of parent documents that match in the first page of results, rather than loads of matches in the first parent document followed by loads of matches in the second, etc. We would then give the user the option to search pages within only one specific parent document.

The solution doesn't have to be "free" and there is a bit of budget to spend.

The content is sensitive and needs to be protected so we can't simply let Google index it for us, at least not in any way that would allow the general public to come across it.

I've looked at using Sphinx with even more resources (putting an index of 50M documents into memory is sadly not an option within our budget) and I've looked at Amazon CloudSearch but it seems that we'd have to spend >$4k per month and that's beyond the budget.

Any suggestions? Something deployable within AWS is a bonus. I'm aware that we may be asking for the unobtainable but if you think that's the case, please say so (and give reasons!)

1
50M documents is not something that big for Sphinx. How big is your index? Things could be optimized ( the way the index is made and/or how the search should be made).aditirex
The size of the index is currently 160GB. I would like to continue with Sphinx but it doesn't seem flexible enough to meet the requirements, and performance is unimpressive at the moment (I admit there might be optimisations I don't know about, though)Coder
I imagine the main benefit will be sharding. Rather than one big monolithic index, split it into bits. Say split over 4 parts. Even on one VM (as long has have multiple virtual cores) - could really benefit. But it could also be benefitial to use multiple VMs (each one being smaller will cost less!). sphinxsearch.com/docs/current.html#distributedbarryhunter

1 Answers

1
votes

50M docs sounds like quite a feasible task for Sphinx.

We originally built this using Sphinx which has served quite well, but is getting slow, despite having quite generous hardware thrown at it (via Amazon AWS).

I second the comment above suggesting sharding. Sphinx allows you to split a big index into several shards, each served by its own agent. You can run the agents on the same server or distribute them across multiple AWS instances.

There are new requirements coming through that mean we have to be able to pre-filter the database before searching, i.e. to search only a subset of the 50M documents based on some aspect of the metadata

Assuming these metafields are indexed as attributes, you can add SQL-alike filters to every search query (e.g. doc_id IN (1,2,3,4) AND date_created > '2014-01-01').

One significant requirement is that we group search results by parent document

You can group by any attribute.