Lucene Scoring for Overlap ranking

Question

I'm new to working with Lucene and trying to understand how I can use Lucene for a simpler scoring function.

I have objects in my dataset with 5-10 terms attached to each of them. Lucene uses TFIDF similarity by default to rank the objects.

TFIDF does not make sense as my data does not varying term frequencies. How can I change the default scoring function so that I rank based on the overlapping keywords?

Doc1 = {system engineering artificial intelligence}

Doc2 = {architecture logic programming}

Doc3 = {sytem architecture engineering}

For the query Query = {system architecture}, I want a ranking where Doc3 is ranked higher than Doc1 and Doc2.

A simple query with one or two terms in it like system architecture above — kami
could be more precise? is it phrase query? term query with boolean clauses? — Mysterion

Mysterion Mysterion · Accepted Answer · 2017-09-04T07:21:08

I could propose to use something like this:

Query query = new BooleanQuery.Builder()
            .add(new TermQuery(new Term("text", "system")), Occur.SHOULD)
            .add(new TermQuery(new Term("text", "architecture")), Occur.SHOULD)
            .build();

in this case doc3 will be ranked higher than doc1 and doc2, but the should clause nature will allow to rank other documents as well.

Lucene Scoring for Overlap ranking

1 Answers