How to handle very frequent updates to a Lucene index

Question

I am trying to prototype an indexing/search application which uses very volatile indexing data sources (forums, social networks etc), here are some of the performance requirements,

Very fast turn-around time (by this I mean that any new data (such as a new message on a forum) should be available in the search results very soon (less than a minute))
I need to discard old documents on a fairly regular basis to ensure that the search results are not dated.
Last but not least, the search application needs to be responsive. (latency on the order of 100 milliseconds, and should support at least 10 qps)

All of the requirements I have currently can be met w/o using Lucene (and that would let me satisfy all 1,2 and 3), but I am anticipating other requirements in the future (like search relevance etc) which Lucene makes easier to implement. However, since Lucene is designed for use cases far more complex than the one I'm currently working on, I'm having a hard time satisfying my performance requirements.

Here are some questions,

a. I read that the optimize() method in the IndexWriter class is expensive, and should not be used by applications that do frequent updates, what are the alternatives?

b. In order to do incremental updates, I need to keep committing new data, and also keep refreshing the index reader to make sure it has the new data available. These are going to affect 1 and 3 above. Should I try duplicate indices? What are some common approaches to solving this problem?

c. I know that Lucene provides a delete method, which lets you delete all documents that match a certain query, in my case, I need to delete all documents which are older than a certain age, now one option is to add a date field to every document and use that to delete documents later. Is it possible to do range queries on document ids (I can create my own id field since I think that the one created by lucene keeps changing) to delete documents? Is it any faster than comparing dates represented as strings?

I know these are very open questions, so I am not looking for a detailed answer, I will try to treat all of your answers as suggestions and use them to inform my design. Thanks! Please let me know if you need any other information.

Shashikant Kore Shashikant Kore · Accepted Answer · 2010-10-01T07:03:06

Lucene now supports Near Real Time Search. Essentially, you get a Reader from IndexWriter everytime you are doing a search. The in-memory changes do not go to disk till the RAM buffer size is reached or an explicit commit is called on the writer. As disk IO is avoided by skipping commit, the searches return quickly even with the new data.

One of the troubles with the Lucene's NRT is the index Logarithm merging algorithm. A merge is trigged after 10 documents are added to a segment. Next, such 10 segments are merged to create a segment with 100 documents and so on. Now, if you have 999,999 documents, and a merge is triggered, it will take quite some time to return, breaking your "real-time" promise.

LinkedIn has released Zoie, a library on top of Lucene that addresses this issue. This is live in production handling millions of updates and searches everyday.

Mostly, Lucene will support all your requirements, as you are discarding old updates and the moving window is roughly of constant size. In case it doesn't, you may have to try Zoie which is proven in the battlefield.

How to handle very frequent updates to a Lucene index

4 Answers