Weighted cosine similarity calculation using Lucene

Question

This question is related with calculating CosineSimilarity between documents using Lucene

The documents are marked up with Taxonomy and Ontology terms separately. When I calculate the document similarity between documents, I want to give higher weights to those Taxonomy terms and Ontology terms.

When I index the document, I have defined the Document content, Taxonomy and Ontology terms as Fields for each document like this in my program.

Field ontologyTerm= new Field("fiboterms", fiboTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);
Field taxonomyTerm = new Field("taxoterms", taxoTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);
Field document = new Field(docNames[curDocNo], strRdElt, Field.TermVector.YES);

I’m using Lucene index .TermFreqVector functions to calculate TFIDF values and, then calculate cosine similarity between two documents using TFIDF values.

I can use Lucene’s field.setBoost() function to give higher weights to the fields before indexing. I used the debugger see frequency values of Taxonomy terms after seeting a boost value, but it dosen’t change the term frequencies. So that means setboost() function dosen’t give any effect on TermFreVector or TFIDF values? Is setboost() function increase the weights and can be used only in document searching?

Another thing what I can do is, programmatically multiply the Taxonomy and Ontology term frequencies with defined weight factor before calculating the TFIDF scores. Will this give higher weight to Taxonomy and Ontology terms in document similarity calculation?

Are there any other Lucene functions that can be used to give higher weights to the certain fields when calculating TFIDF values using TermFreqVector? Or can I just use the setboost() function for this purpose, then how?

You have posted 8 questions so far and accepted none of the answers. You have bad karma and people will be ill-disposed to help you. Go back to your questions and accept the answers. If you are not satisfied with the answer you get, you are supposed to work at it with the answerer until you are satisfied. Abandoning questions after someone has gone to the trouble to help you is not nice. — Marko Topolnik

Xodarap Xodarap · Accepted Answer · 2012-04-19T19:26:25

The TermFreqVector is just that - the term frequencies. No weights. It says in the docs "Each location in the array contains the number of times this term occurs in the document or the document field."

You can see from Lucene's algorithm that the way boosts are used is as a multiplicative factor. So if you want to replicate that then yes this will give your terms a higher weight.

I'm not sure what your use case is, but you might want to consider just using Lucene's Scorer class. Then you won't have to deal with making your own.

Weighted cosine similarity calculation using Lucene

1 Answers