This question is related with calculating CosineSimilarity between documents using Lucene
The documents are marked up with Taxonomy and Ontology terms separately. When I calculate the document similarity between documents, I want to give higher weights to those Taxonomy terms and Ontology terms.
When I index the document, I have defined the Document content, Taxonomy and Ontology terms as Fields for each document like this in my program.
Field ontologyTerm= new Field("fiboterms", fiboTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);
Field taxonomyTerm = new Field("taxoterms", taxoTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);
Field document = new Field(docNames[curDocNo], strRdElt, Field.TermVector.YES);
I’m using Lucene index .TermFreqVector functions to calculate TFIDF values and, then calculate cosine similarity between two documents using TFIDF values.
I can use Lucene’s field.setBoost() function to give higher weights to the fields before indexing. I used the debugger see frequency values of Taxonomy terms after seeting a boost value, but it dosen’t change the term frequencies. So that means setboost() function dosen’t give any effect on TermFreVector or TFIDF values? Is setboost() function increase the weights and can be used only in document searching?
Another thing what I can do is, programmatically multiply the Taxonomy and Ontology term frequencies with defined weight factor before calculating the TFIDF scores. Will this give higher weight to Taxonomy and Ontology terms in document similarity calculation?
Are there any other Lucene functions that can be used to give higher weights to the certain fields when calculating TFIDF values using TermFreqVector? Or can I just use the setboost() function for this purpose, then how?