0
votes

Experience problem with indexing lot's of content data. Searching for the suitable solution.

The logic if following:

  • Robot is uploading content every day to the database.

  • Sphinx index must reindex only new (daily) data. I.e. the previous content is never being changed.

Sphinx delta indexing is an exact solution for this, but with too much content the error is rising: too many string attributes (current index format allows up to 4 GB).

Distributed indexing seems to be usable, but how to dynamically (without dirty hacks) add & split indexing data?

  • I.e.: day 1 there are total 10000 rows, day 2 - 20000 rows and etc. The index throws >4GB error on about 60000 rows.

  • The expected index flow: 1-5 day there is 1 index (no matter distributed or not), 6-10 day - 1 distributed (composite) index (50000 + 50000 rows) and so on.

  • The question is how to fill distributed index dynamically?

     Daily iteration sample:
         main index
            chunk1  - 50000 rows
            chunk2  - 50000 rows
            chunk3  - 35000 rows           
    
         delta index
            10000 new rows
    
         rotate "delta"
    
         merge "delta" into "main"
    

Please, advice.

1
have you looked at real time indexes? They seem a perfect fit for your usecase - barryhunter
Exact. The RT-indexes is what I'am looking for. Thanks! - Dmitry

1 Answers

1
votes

Thanks to #barryhunter

RT indexes is a solution here.

Good manual is here: https://www.sphinxconnector.net/Tutorial/IntroductionToRealTimeIndexes

I've tested match queries on 3 000 000 000 letters. The speed is close to be the same as for "plain" index type. Total index size on HDD is about 2 GB.

Populating sphinx rt index: CPU usage: ~ 50% of 1 core / 8 cores, RAM usage: ~ 0.5% / 32 GB, Speed: quick as usual select - insert (mostly depends on using batch insert or row-by-row)

NOTE: "SELECT MAX(id) FROM sphinx_index_name" will produce error "fullscan requires extern docinfo". Setting docinfo = extern will not solve this. So keep counter simply in mysql table (like for sphinx delta index: http://sphinxsearch.com/docs/current.html#delta-updates).