0
votes

I'm not sure to understand how vector space model is used in lucene scoring.

I read here (https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html) that lucene scores a document as the sum of the tf-idf of each term query (if we omit coordination factor, field length and boosts). I don't understand how vector space model is used.

Space vector model could be used to calculate the similarity between the tf-idf vector of a document and the tf-idf vector of the query. This should give us a CosSimilarity score between the query and a document. The score would be between 0 and 1, so different requests should be easy to compare.

Why not using lucene score ?

1

1 Answers

2
votes

Lucene uses the 'practical score function' mentioned in your link, which is an approximation of the cosine similarity - extended to support 'practical' features such as boosts.

If you take the vector space cosine similarity formula for a query q and a document d, you have:

s(q, d) = q * d / (||q|| * ||d||)

Considering that q and d are vectors like [tf(t1) * idf(t1), ...], and that in the q vector tf(t) is either 1 or 0, the formula becomes:

s(q, d) = ∑( tf(t in d) * idf(t)² )(t in q) / (||q|| * ||d||)

You can further replace ||q|| with 1 / queryNorm(q) given their definition queryNorm = 1 / √sumOfSquaredWeights

s(q, d) = queryNorm(q) * ∑( tf(t in d) * idf(t)² )(t in q) / ||d||

which is close to the formula they give in the docs:

score(q, d) = queryNorm(q) * coord(q,d) * 
              ∑ ( tf(t in d) * idf(t)² * t.getBoost() * norm(t,d)) (t in q)  

||d||, the norm of the document vector, however, does not have a direct equivalent in the terms of their formula.