Lucene custom scoring (Lucene 3.2) involves iterating through all documents in the index - fastest way?

Question

I'm trying to implement a custom scoring formula in Lucene that has nothing to do with tf-idf (so changing just the similarity, for example, will not work).

In order to do this, I need to be able to take my custom Query and generate a score for every document stored in the index - not just the ones that match the terms in the query (since my scoring involves checking what are essentially synonyms, so even if a doc doesn't have the exact Terms, it could still produce a positive score). Is the best way to simply create an IndexReader and call Document d = reader.doc(i) for all docs (as described here), and then generate a score on the spot?

I've been looking around at Lucene's scoring internals, specifically various Scorer and Collector classes, and it appears that what happens (for Lucene 3.2) is a Weight provides a Scorer, which along with the Collector loops through all documents that match the query. Can I utilize this structure in some way, but again get a custom Scorer implementation to consider ALL documents?

I'm very curious as to what kind of scoring you're trying to implement. — Fred Foo

Marko Topolnik Marko Topolnik · Accepted Answer · 2011-12-17T20:18:41

If you decide to go for a custom scoring scheme, the proper way is to use a subclass of CustomScoreQuery with getCustomScoreProvider overridden to return your subclass of CustomScoreProvider. The CustomScoreQuery constructor requires a subquery. Here you will want to provide a fast native Lucene Query that will narrow down the result set as much as possible before going through your custom score calculation. You can also arrange to store any number of float values with each of your docs and make those accessible to your custom score provider. You will need to provide an appropriate ValueSourceQuery to the constructor of CustomScoreQuery for each such float value. See the Javadocs on these classes, they are well written. Unfortunately I don't have a Java snippet at hand.

Lucene custom scoring (Lucene 3.2) involves iterating through all documents in the index - fastest way?

3 Answers