11
votes

I am looking for a way of coding the lucene fuzzy query that searches all the documents, which are relevant to an exact phrase. If I search "mosa employee appreciata", a document contains "most employees appreciate" will be returned as the result.

I tried to use:

FuzzyQeury = new FuzzyQuery(new Term("contents","mosa employee appreicata"))

Unfortunately, it empirically doesn't work. The FuzzyQuery employs the editor distance, theoretically, "mosa employee appreciata" should be matched with "most employees appreciate" provide the appropriate distance is given. It seems a bit odd.

Any clues? Thank you.

4
Additional details are needed: How did you index the contents field? What Analyzer are you using? Did you try a closer search (start with the exact phrase, then change a single character,...)? How much latitude do you give in the query parameters? What exactly did you get?Yuval F

4 Answers

16
votes

There are two likely problems here. First: I'm guessing the "contents" field is being analyzed such that "most employees apreciate" is not a term, but rather three terms. Defining as a single term is not appropriate in this case.

However, even if the content listed is a single term, a second likely problem we have is that there is too much distance between the terms to get a match. The Damerau-Levenshtein distance between mosa employee appreicata and most employees appreciate is 4 (the approximate distance, incidentally, between my average first shot at spelling "Damerau-Levenshtein" and the correct spelling). Fuzzy Query, as of 4.0, handles edit distances of no more than 2, due to performance constraints, and the assumption that larger distances are usually not particularly relevant.

If you need to perform a phrase query with fuzzy terms, you should look into either MultiPhraseQuery, or combine a set of SpanQueries (especially SpanMultiTermQueryWrapper and SpanNearQuery) to meet your needs.

SpanQuery[] clauses = new SpanQuery[3];
clauses[0] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("contents", "mosa")));
clauses[1] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("contents", "employee")));
clauses[2] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("contents", "appreicata")));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true)

And since none of the individual terms have an edit distance greater than 2, this should be more effective.

1
votes

ComplexPhraseQueryParser handles fuzzy searching on phrase words - i.e., specify the words that should be fuzzy searched and those that should not. Works as follows

Query query = new ComplexPhraseQueryParser("content", analyzer)
                    .parse("some test~ query~ blah blah");

Seems to work nicely. Not sure about performance, however but seems to work well on small data sets.

0
votes

The answer from femtoRgon is great! Thank you.

There is another way to solve this problem.

//declare a mutilphrasequery
MultiPhraseQuery childrenInOrder = new MultiPhraseQuery();

//user fuzzytermenum to enumerate your query string
FuzzyTermEnum fuzzyEnumeratedTerms1 = new FuzzyTermEnum(reader, new Term(searchField,"mosa"));
FuzzyTermEnum fuzzyEnumeratedTerms2 = new FuzzyTermEnum(reader, new Term(searchField,"employee"));
FuzzyTermEnum fuzzyEnumeratedTerms3 = new FuzzyTermEnum(reader, new Term(searchField,"appreicata"));

//this basically pull out the possbile terms from the index             
Term termHolder1 = fuzzyEnumeratedTerms1.term();
Term termHolder2 = fuzzyEnumeratedTerms2.term();
Term termHolder3 = fuzzyEnumeratedTerms3.term();

//put the possible terms into multiphrasequery
if (termHolder1==null){
    childrenInOrder.add(new Term(searchField,"mosa"));
}else{
    childrenInOrder.add(fuzzyEnumeratedTerms1.term());
}

if (termHolder2==null){
    childrenInOrder.add(new Term(searchField,"employee"));
}else{
    childrenInOrder.add(fuzzyEnumeratedTerms2.term());
}

if (termHolder3==null){
    childrenInOrder.add(new Term(searchField,"appreicata"));
}else{
    childrenInOrder.add(fuzzyEnumeratedTerms3.term());
}


//close it - it is important to close it
fuzzyEnumeratedTerms1.close();
fuzzyEnumeratedTerms2.close();
fuzzyEnumeratedTerms3.close();
0
votes

I had some (very small) millage with the following:

String[] searchTerms = searchString.split(" ");
FuzzyLikeThisQuery fltw = new FuzzyLikeThisQuery(searchTerms.length, new StandardAnalyzer());
Arrays.stream(searchTerms)
    .forEach(term -> fltq.addTerms(term, FIELD, SIMILARITY_IN_EDITS, PREFIX_LENGTH); 

This query matches far too distant strings with the index. String that don't match are ones where each of the terms are distant by more than 2 edits from the terms used in the indexed content.

Please use at your own peril.