0
votes

I did a simple query on unit test data and the retults come back in the expected order. The query is

+(ancestors:wood name:wood) +(ancestors:screw name:screw)

The data and score is:

  • First document (score 0.9944593)
    • name : Wood Screws
    • ancestors : Screws and fasteners
  • Second document (score 0.7294933)
    • name : Wood Plugs
    • ancestors : Screws and fasteners
    • ancestors : Screw Plugs
  • Third document (score 0.49740157)
    • name : Wood screws
    • ancestors : Other products

If I do the same query on production data (~3000 documents), I still get the "only" same three results. But the document score changes the order.

  • First document (score 3.9986732)
    • name : Wood screws
    • ancestors : Other products
  • Second document (score 3.9986732)
    • name : Wood Screws
    • ancestors : Screws and fasteners
  • Third document (score 3.7507305)
    • name : Wood Plugs
    • ancestors : Screws and fasteners
    • ancestors : Screw Plugs

The second order seems wrong. Intuitivly, I would have expected the test order to be preserved as documents 2 and 3 both match three word and document 1 only two.

The fact that the first two documents have identical score is also strange. I have also tested 5 other similarity methods and they all give equal scores for the first two documents.

I'm using Lucene 8.5.2 with BM25Similarity and default parameters.

Why does Lucene document "relative score" change from unit test when the same documents are found? How can I improve this scoring issue?

1

1 Answers

0
votes

It is because of BM25 scoring algorithm. This scoring function is calculated as follows: enter image description here

The important part of this dormula is IDF part. IDF is short for "inverse document frequency". IDF is calculated relating to your document collection. So if the collection is changed this value which is calculated for each term may changed. This mean it can be increased or decreased. The tuition is if a term occur in more document it bring less value for the containing document. For example the term "is" is not valuable. Because it exist in almost all document. So we can't use it as determiner to identify the relevant document. Lets another example. The term "java" has more valuable than the term "is". Be cause it is in many many fewer document and we know it is not in all document. So It can be use as a determiner with higher score. So now you know if a term occurs in more document it bring less value for that document. The simplest form of IDF is calculated as follows:

IDF(term)= Log(N/n)

N is the number of all document and n is the number of all document containing at least one term which is "term" (think of term as "java"). you can see the more a term occurs in document(n grows) the less score it gains.