2
votes

I would like to search a Lucene index with edit distances. For example, say, there is a document with a field FIRST_NAME; I want all documents with first names that are 1 edit distance away from, say, 'john'.

I know that Lucene supports fuzzy searches (FIRST_NAME:john~) and takes a number between 0 and 1 to control the fuzziness. The problem (for me) is this number does not directly translate to an edit distance. And when the values in the documents are short strings (less than 3 characters) the fuzzy search has difficulty finding them. For example if there is a document with FIRST_NAME 'J' and I search for FIRST_NAME:I~0.0 I don't get anything back.

2

2 Answers

4
votes

In Lucene's FuzzyQuery, you cannot specify the extact distance. You can specify the value of "fuzziness" between 0 and 1 where values closer to 0 indicate broad match and values closer to 1 indicate narrow match. The formula for "fuzziness" is as follows. (From Lucene in Action)

From this formula, you can work back to an approximate fuzziness for given value of distance. So, StackOverflow is to be matched with StackUnderflow, which is at a distance of 3, the fuzziness required will be approximately 0.77.

2
votes

If you only need 1 edit distance away and if the result can contain the exact match, then you can use the single character wildcard in the query language. If the name is

john

then the query that matches it and any term within 1 edit distance would look like

?john OR j?ohn OR jo?hn OR joh?n OR john? OR ohn OR jhn OR joh OR ?ohn OR j?hn OR jo?n OR joh?

For more more complex cases, you might need to get a list of terms in the index (using IndexReader.term()), keep the ones that are 1 edit distance away, and search for any of those terms.