My requirement is to generate match score on fuzzy phrase search. Example
1) Input Data - Hello Sam, how are you doing? Thanks, Smith.
- Indexed Document - Sam Smith (documents are always person/organization names and input data would be free-text data)
In above case, both Sam and Smith found in my input data but contextually both are different persons. If my input data would be "Hello Sam Smith" than I should get relevant hit with higher score (Also I am expecting OK score for "Hello Sam John Smith" and so on).
I am using Lucene here for primary filtering and later will post-process matched documents with input data and define match score (using levenshtein), and it should also work for fuzzy.
Exact approach,
1) Index documents as Tri-Grams
2) Search input free text data with Tri-Gram indexes
3) Gather all matching documents (this will have lot of noisy data yet)
4) Post process every matched document and define position of every matched tri-gram token in input free text data and calculate levenshtein score between possible position token(s) and entire document.
e.g - Hello Sam, how are you doing? Thanks, Smith.
In here my document match will be "Sam Smith", I want to look at each tri gram of index and its position match in input free text data, like
1) token "sam" matched with 2nd position word "Sam" in input data
2) token "smi" matched with 8th position word "Smith" in input data
Later I will write logic to calculate levenshtein score of token 2, 8 with actual matched document (it would be very less score considering proximity between position 2 and 8), but if position of tokens were 2,3 (or 2,4) I would have given good scoring.
Would like to get feedback from experts on this approach, or some better suggestions, Thanks.