Experience problem with indexing lot's of content data. Searching for the suitable solution.
The logic if following:
Robot is uploading content every day to the database.
Sphinx index must reindex only new (daily) data. I.e. the previous content is never being changed.
Sphinx delta indexing is an exact solution for this, but with too much content the error is rising: too many string attributes (current index format allows up to 4 GB).
Distributed indexing seems to be usable, but how to dynamically (without dirty hacks) add & split indexing data?
I.e.: day 1 there are total 10000 rows, day 2 - 20000 rows and etc. The index throws >4GB error on about 60000 rows.
The expected index flow: 1-5 day there is 1 index (no matter distributed or not), 6-10 day - 1 distributed (composite) index (50000 + 50000 rows) and so on.
The question is how to fill distributed index dynamically?
Daily iteration sample: main index chunk1 - 50000 rows chunk2 - 50000 rows chunk3 - 35000 rows delta index 10000 new rows rotate "delta" merge "delta" into "main"
Please, advice.