I have a Lucene application with multiple indices in which the relevancy scoring suffers due to differences in the term frequencies across the different indices. My understanding is that the Term Dictionary (.tim file) contains "term statistics" such as the document frequency statistics on each term. I was thinking that one approach might be to modify the .tim files for each index (and related segments) and update the "term statistics". Is it possible to overwrite or modify the .tim and .tip files in such a way?
2 Answers
relevancy scoring suffers
From the FAQ:
score values are meaningful only for purposes of comparison between other documents for the exact same query and the exact same index. when you try to compute a percentage, you are setting up an implicit comparison with scores from other queries.
Is it possible? I suppose, but it strikes me as about as good an idea as attempting to change an application by directly modifying the compiled binaries.
If you need very specific things from scoring, then you should generally implement a Similarity that does what you need. Extending TFIDFSimilarity is often a good idea. Really not clear on what the exact problem is, so I can't provide any more specific guidance than that, but perhaps that provides a point in the right general direction.