1
votes

I'm very confused by some Lucene.NET behavior I'm observing. I assume the same is true in Java's Lucene, but have not verified. Here's a test to demonstrate:

[Fact]
public void repro()
{
    var directory = new RAMDirectory();
    var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);

    float firstScore, secondScore, thirdScore;

    using (var indexWriter = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED))
    {
        var document = new Document();
        document.Add(new Field("id", "abc", Field.Store.YES, Field.Index.NOT_ANALYZED));
        document.Add(new Field("field", "some text in the field", Field.Store.NO, Field.Index.ANALYZED));
        indexWriter.UpdateDocument(new Term("id", "abc"), document, analyzer);

        // the more times I call UpdateDocument here, the higher the score is for the subsequent hit
//                indexWriter.UpdateDocument(new Term("id", "abc"), document, analyzer);
        indexWriter.Commit();

        var queryParser = new QueryParser(Lucene.Net.Util.Version.LUCENE_30, "field", analyzer);
        var parsedQuery = queryParser.Parse("some text in the field");

        using (var indexSearcher = new IndexSearcher(directory, readOnly: true))
        {
            var hits = indexSearcher.Search(parsedQuery, 10);
            Assert.Equal(1, hits.TotalHits);
            firstScore = hits.ScoreDocs[0].Score;
        }

        using (var indexSearcher = new IndexSearcher(directory, readOnly: true))
        {
            var hits = indexSearcher.Search(parsedQuery, 10);
            Assert.Equal(1, hits.TotalHits);
            secondScore = hits.ScoreDocs[0].Score;
        }

        document = new Document();
        document.Add(new Field("id", "abc", Field.Store.YES, Field.Index.NOT_ANALYZED));
        document.Add(new Field("field", "some changed text in the field", Field.Store.NO, Field.Index.ANALYZED));

        // if I call DeleteAll here, then score three is the same as score one and two (which is probably fine, though not quite what I expected either)
//                indexWriter.DeleteAll();

        indexWriter.UpdateDocument(new Term("id", "abc"), document, analyzer);
        indexWriter.Commit();

        using (var indexSearcher = new IndexSearcher(directory, readOnly: true))
        {
            var hits = indexSearcher.Search(parsedQuery, 10);
            Assert.Equal(1, hits.TotalHits);
            thirdScore = hits.ScoreDocs[0].Score;
        }
    }

    // this is fine
    Assert.Equal(firstScore, secondScore);

    // this is not
    Assert.True(thirdScore < secondScore);
}

The steps are:

  1. Add a document to the index with "some text in the field" as its indexed text.
  2. Search for "some text in the field" twice, recording the scores as firstScore and secondScore
  3. Update the document so that the indexed text is now "some changed text in the field"
  4. Search for "some text in the field" again, recording the score as thirdScore
  5. Assert that the first and second scores are equal, and the the third score is less than the first and second

The really weird thing is that thirdScore is greater than firstScore and secondScore. Here's what I've found:

  • the more times I call UpdateDocument on the index with the same document, the higher the score will become
  • completely deleting the index before performing the third search yields a score equal to the first and second scores. I was expecting a little bit less because of the extra word in the indexed text ("changed"), but even having the scores equal would suffice
  • boycotting RemoveDocument and instead manually deleting and adding the document makes no difference
  • calling WaitForMerges on the index after committing makes no difference

Can anyone explain this behavior to me? Why would the scores change over subsequent updates to the document when neither the document content nor the query is changing?

1

1 Answers

3
votes

First thing, the most useful tool you should be aware of when trying to understand why something is being scored a certain way: IndexSearcher.Explain

Explanation explain = indexSearcher.Explain(parsedQuery, hits.ScoreDocs[0].Doc);

Whichs gives us a detailed explaination of how that score was arrived at. In this case, the two different scoring queries look very similar except the idf scores of the for the third query look like this:

0.5945349 = idf(docFreq=2, maxDocs=2)

Compared to, in the first two queries:

0.3068528 = idf(docFreq=1, maxDocs=1)

A Lucene update is just a deletion followed by an insertion. Deletions generally just flag a document for deletion, and wait until later for actually purging the data from the index. So, you won't see deleted documents in search results, but they do still influence statistics like docfreq. The impact is usually pretty minimal when you have a lot of data.

You can force the index to ExpungeDeletes to see this:

indexWriter.UpdateDocument(new Term("id", "abc"), document, analyzer);
indexWriter.Commit();

//arugment=true to block until completed.
indexWriter.ExpungeDeletes(true);
indexWriter.Commit();

And then you should see them all get the same score.

Bear in mind, Expunging deletes can be a quite expensive operation. In practice, you probably should not be doing it after every update.


As to why you are getting the same score for the document with "some text in the field" and that with "some changed text in the field", you are referring to the lengthNorm score factor. The lengthNorm is calculated at index time, and stored in the field's norm, and norms are compressed in a very lossy fashion, down to a single byte, for performance. All told, they have three bits of precision, not quite even one significant decimal digit. So, there just isn't enough difference between those two to be represented in the score. Try it with something like:

some more significantly changed text in the field

And you should see the lengthNorm take effect.