1
votes

I want to use Lucene with the following scoring logic: When I index my documents I want to set for each field a score/weight. When I query my index I want to set for each query term a score/weight.

I will NEVER index or query with many instances of the same field – In each query (document) there will be 0-1 instances with the same field name. My fields/query term are not analyzed – they are already made out of one token.

I want the score to be simply the dot product between the fields of the query to the fields of the document if they have the same value.

For example:
Format is (Field Name) (Field Value) (Field Score)
Query:
1 AA 0.1
7 BB 0.2
8 CC 0.3

Document 1:
1 AA 0.2
2 DD 0.8
7 CC 0.999
10 FFF 0.1

Document 2:
7 BB 0.3
8 CC 0.5

The scores should be:
Score(q,d1) = FIELD_1_SCORE_Q * FILED_1_SCORE_D1 = 0.1 * 0.2 = 0.02
Score(q,d2) = FIELD_7_SCORE_Q * FILED_7_SCORE_D2 + FIELD_8_SCORE_Q * FILED_8_SCORE_D2 = (0.2 * 0.3) + (0.3 * 0.5)

What would be the best way implement it? In terms of accuracy and performances (I don’t need TF and IDF calculations).

I currently implemented it by setting boosts to the fields and query terms. Then I overwritten the DefaultSimilarity class and set it as default before indexing/querying:

public class MySimilarity extends DefaultSimilarity {

    @Override
    public float computeNorm(String field, FieldInvertState state) {
        return state.getBoost();
    }

    @Override
    public float queryNorm(float sumOfSquaredWeights) {
        return 1;
    }

    @Override
    public float tf(float freq) {
        return 1;
    }

    @Override
    public float idf(int docFreq, int numDocs) {
        return 1;
    }

    @Override
    public float coord(int overlap, int maxOverlap) {
        return 1;
    }

}


And based on http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/scoring.html this should work.
Problems:

  1. Performances: I am calculating all the TF/IDF stuff and NORMS for nothing…
  2. The score I get from the TopScoreDocCollector is not the same as I get from the Explanation.

Here is part of my code:

indexSearcher = new IndexSearcher(IndexReader.open(directory, true));
TopScoreDocCollector collector = TopScoreDocCollector.create(iTopN, true);
indexSearcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
for (int i = 0; i < hits.length; ++i) {
  int docId = hits[i].doc;
  Document d = indexSearcher.doc(docId);
  double score = hits[i].score;
  String id = d.get(FIELD_ID);
  Explanation explanation = indexSearcher.explain(query, docId);
}

Thanks!

2

2 Answers

0
votes

There are several things that you can fix:

  • you don't set your custom similarity in the snippet of code you pasted, see IndexSearcher#setSimilarity,

  • the ''tf'' method of your implementation of Similarity should return 0 when freq is equal to 0.

Moreover, you should be careful with index-time boosts. Due to the fact that they are encoded on a single byte, there might be some precision loss, see In Lucene, why do my boosted and unboosted documents get the same score?.

One alternative to index-time boosts could be index the boost values in a different numeric field and then to use a CustomScoreQuery and a float FieldCacheSource to leverage these boosts in the scores.

0
votes

Figured out the answer - it's working great!

Inspired by another thread in the Lucene mailing list (Question about CustomScoreQuery) I am using this solution which is working really well (with one drawback):
I discovered that some of my problems were due to the fact that my assumption was wrong:
I did have many fields/queries terms with the same field ID.

This ruined my approach because the query boost was aggregated and my calculations were wrong.

What I did was during indexing I added the field value to the field id (concatenated it by '_') and as filed value used the desired score.

At search time I am using simple FieldScoreQuery (As-is, no modifications needed) with the complex field ID.

Here I can still use the setBoost to set the score because now my filed are unique.

Logic wise this is perfect - dot product using Lucene.

Drawback - many many different types of fields.

IMPORTANT:
Since I am not using the indexed documents fields' norms, because the weight is the value of the field, I am now indexing the fields using:

Field field = new Field(field_name, Float.toString(weight), Store.YES, Index.NOT_ANALYZED_NO_NORMS);<br>

And the memory usage is back to normal...
So cool!