Problem Description
I'm trying to implement a custom algorithm to match user provided free-text input, a company name such as "Ford Motor", against a reference data source consisting of 1.4 million company names.
The algorithm executes following steps:
Step 1) Performs an "Exact Match", followed by "Begins Match" and finally "Contains Match" of user provided search input. Results from this step are also sorted in the same order.
Step 2) Performs a token by token match of search input with reference company name.
Every token is matched in following order: Exact, Begins, Contains, Levenshtein Distance (< 0.2) and Refined Soundex.
E.g. If user input is "Foord Motur Holding" and it's being matched against "The Ford Motor Holdings Company" then first token "Foord" will match "Ford" based on Soundex match, second token "Motur" will match "Motor" based on Edit Distance Algo and and last token "Holding" will match "Holdings" via Begins match.
Scoring: Every token match is first scored on a scale that rates the matching technique, with Exact match being the best and Soundex being the worst.
The overall score is calculated, on a scale of 0-100%, by calculating a weighted average of individual token-match scores. Weights are assigned based on index-order of token i.e. the first token has highest weight and last token has lowest.
My Partial Solution
I have implemented a simple schema in solr to store referance company names. A String field (called companyName), a simple text field (called as companyText) copied from string and another text field (called as companySoundex) copied from string and using PhoneticFilterFactory for Refined Soundex based matching.
I have been able to replicate step 1) in a single solr query.
For step 2) I plan to fire 3 parallel queries to solr server. First query performing a simple text search on companyText field, second query performing fuzzy match using ~ operator on companyText field and third query performing soundex match on companySoundex field. I plan to somehow combine the results from these 3 parallel queries to get desired final result.
Questions:
1) Is there a better way to replicate Step 2) of original algorithm?
2) Even if I go with my "three-parallel-queries" approach then how to get the "right" sorting order as I get in the original algorithm ? I guess the main problem is how to compare the solr scores from these 3 entirely different queries to do the final combining of results
Thanks for reading this long question. Any help/pointers would be greatly appreciated.