3
votes

I'm trying to implement a custom scoring formula in Lucene that has nothing to do with tf-idf (so changing just the similarity, for example, will not work).

In order to do this, I need to be able to take my custom Query and generate a score for every document stored in the index - not just the ones that match the terms in the query (since my scoring involves checking what are essentially synonyms, so even if a doc doesn't have the exact Terms, it could still produce a positive score). Is the best way to simply create an IndexReader and call Document d = reader.doc(i) for all docs (as described here), and then generate a score on the spot?

I've been looking around at Lucene's scoring internals, specifically various Scorer and Collector classes, and it appears that what happens (for Lucene 3.2) is a Weight provides a Scorer, which along with the Collector loops through all documents that match the query. Can I utilize this structure in some way, but again get a custom Scorer implementation to consider ALL documents?

3
I'm very curious as to what kind of scoring you're trying to implement.Fred Foo

3 Answers

2
votes

If you decide to go for a custom scoring scheme, the proper way is to use a subclass of CustomScoreQuery with getCustomScoreProvider overridden to return your subclass of CustomScoreProvider. The CustomScoreQuery constructor requires a subquery. Here you will want to provide a fast native Lucene Query that will narrow down the result set as much as possible before going through your custom score calculation. You can also arrange to store any number of float values with each of your docs and make those accessible to your custom score provider. You will need to provide an appropriate ValueSourceQuery to the constructor of CustomScoreQuery for each such float value. See the Javadocs on these classes, they are well written. Unfortunately I don't have a Java snippet at hand.

0
votes

As I understand Lucene, it stores (term, doc) pairs in its index, so that querying is implemented as

  1. Get documents containing the query terms,
  2. score/sort them.

I've never implemented my own scoring, but I'd look at IndexReader.termDocs first; it seems to implement step 1.

0
votes

With IndexReader.termDocs you can iterate through a term's posting list, that is, all documents that contain that term. You could use this to provide your whole own query processing own top of Lucene, but then you won't be able to use any of Query, Similarity and stuff.

Also, if you are working with synonyms Lucene has some things in the contrib package. Another possible solution, don't know if you tried it, is to inject synonyms into the documents through a Analyzer (or other). That way you could return documents even if they don't have query terms.