0
votes

It it possible to search for document similarity based on term-vector position in lucene?
For example there are three documents with content as follows

1: Hi how are you
2: Hi how you are
3: Hi how are you

Now if doc 1 is searched in lucene then it should return doc 3 with more score then doc 2 with less score because doc 2 has "you" and "are" words at different positions,

In short lucene should return exact matching documents with term positions

1

1 Answers

0
votes

I think what you need is a PhraseQuery, it is a Lucene Query type that will take into account the precise position of your tokens and allow you to define a slop or permutation tolerance regarding those tokens.

In other words the more your tokens differ from the source in terms of positions the less they will be scored.

You can use it like that :

QueryBuilder analyzedBuilder = new QueryBuilder(new MyAnalyzer());
PhraseQuery query = analyzedBuilder.createPhraseQuery("fieldToSearchOn", textQuery);

the createPhraseQuery allows for a third parameter the slop I alluded to if you want to tweak it.

Regards,