I'm working on a project where we index relatively small documents/sentences, and we want to search these indexes using large documents as query. Here is a relatively simple example : I'm indexing document :
docId : 1
text: "back to black"
And i want to query using the following input :
"Released on 25 July 1980, Back in Black was the first AC/DC album recorded without former lead singer Bon Scott, who died on 19 February at the age of 33, and was dedicated to him."
What is the best approach for this in Lucene ? For simple examples, where the text i want to find is exactly the input query, i get better results using my own analyzer + a PhraseQuery than using QueryParser.parse(QueryParser.escape(...my large input...)) - which ends up creating a big Boolean/Term Query.
But i can't try to use a PhraseQuery approach for a real world example, i think i have to use a word N-Gram approach like the ShingleAnalyzerWrapper but as my input documents can be quite large the combinatorics will become hard to handle...
In other words, i'm stuck and any idea would be greatly appreciated :)
P.S. i didn't mention it but one of the annoying thing with indexing small documents is also that due to "norms"-value (float) being encoded on only 1 byte, all 3-4 words sentences get the same Norm Value, so searching sentences like "A B C" makes results "A B C" and "A B C D" show up with the same score.
Thanks !