9
votes

I am trying to prototype an indexing/search application which uses very volatile indexing data sources (forums, social networks etc), here are some of the performance requirements,

  1. Very fast turn-around time (by this I mean that any new data (such as a new message on a forum) should be available in the search results very soon (less than a minute))

  2. I need to discard old documents on a fairly regular basis to ensure that the search results are not dated.

  3. Last but not least, the search application needs to be responsive. (latency on the order of 100 milliseconds, and should support at least 10 qps)

All of the requirements I have currently can be met w/o using Lucene (and that would let me satisfy all 1,2 and 3), but I am anticipating other requirements in the future (like search relevance etc) which Lucene makes easier to implement. However, since Lucene is designed for use cases far more complex than the one I'm currently working on, I'm having a hard time satisfying my performance requirements.

Here are some questions,

a. I read that the optimize() method in the IndexWriter class is expensive, and should not be used by applications that do frequent updates, what are the alternatives?

b. In order to do incremental updates, I need to keep committing new data, and also keep refreshing the index reader to make sure it has the new data available. These are going to affect 1 and 3 above. Should I try duplicate indices? What are some common approaches to solving this problem?

c. I know that Lucene provides a delete method, which lets you delete all documents that match a certain query, in my case, I need to delete all documents which are older than a certain age, now one option is to add a date field to every document and use that to delete documents later. Is it possible to do range queries on document ids (I can create my own id field since I think that the one created by lucene keeps changing) to delete documents? Is it any faster than comparing dates represented as strings?

I know these are very open questions, so I am not looking for a detailed answer, I will try to treat all of your answers as suggestions and use them to inform my design. Thanks! Please let me know if you need any other information.

4

4 Answers

6
votes

Lucene now supports Near Real Time Search. Essentially, you get a Reader from IndexWriter everytime you are doing a search. The in-memory changes do not go to disk till the RAM buffer size is reached or an explicit commit is called on the writer. As disk IO is avoided by skipping commit, the searches return quickly even with the new data.

One of the troubles with the Lucene's NRT is the index Logarithm merging algorithm. A merge is trigged after 10 documents are added to a segment. Next, such 10 segments are merged to create a segment with 100 documents and so on. Now, if you have 999,999 documents, and a merge is triggered, it will take quite some time to return, breaking your "real-time" promise.

LinkedIn has released Zoie, a library on top of Lucene that addresses this issue. This is live in production handling millions of updates and searches everyday.

Mostly, Lucene will support all your requirements, as you are discarding old updates and the moving window is roughly of constant size. In case it doesn't, you may have to try Zoie which is proven in the battlefield.

4
votes

You might want to consider using Solr rather than straight-up Lucene. Solr handles all of the requirements you mentioned (near-realtime updates, deleting documents, performance/sharding, range queries), and it'll do it better than your own hand-rolled code. You won't have to deal with issues in the IndexReader level, i.e. when to refresh the IndexReader after an update.

As far as range queries go, Solr has TrieField capabilities, which makes numeric range queries super fast. See http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/

0
votes

A: I think with the latest versions of Lucene, the optimize method is not really needed and with my suggestion for item C, it really shouldn't be needed.

B: Again, I think with the latest version of Lucene, the Searchers are aware when updates are done and can handle that without you needing to do anything special.

C: I'd avoid deleting and just create a new index daily. If you store the age of the document in the index, then you can use the existing index to create the new one. During your index writing fetch all of the young documents, walk through them and add them to your new index. Have a public util method called getCurrentIndex that is used by the searchers to grab the latest live index. Keep 1 or 2 old indexes around just in case and you should be good to go.

0
votes

You can cache your index searcher for a short period of time and the reopen it. We use for this purpose asp.net WebCache which has CacheItemUpdateCallback that is called right before chached item expires.