1
votes

In Lucene, I can use fuzzy search to get 'similar' results.

For example, following query:

text:awesome~0.8

Will find the documents having 80% similar texts, like 'awesom'.

My question is, can I use fuzzy search on entire text (multiple words)?

For example, I want to find out 80% similar texts to following text:

this is my text with multiple words

Putting fuzzy clause on each word would not give me desired results:

text:(+this~0.8 +is~0.8 +my~0.8 +text~0.8 +with~0.8 +multiple~0.8 +words~0.8)

As it would return only those documents which has all the words (or 80% similar words against each word) specified in query.

I expect query to return me results where entire string is 80% similar (even if it doesn't have an entire word), for example:

this is text with multiple words

Something like this -

text:(+this +is +my +text +with +multiple +words)~0.8

Obviously above query gives syntax error, but I need to get results based on similarity on entire text/phrase.

I am happy to use Java API classes for this purpose as I need to use it in a Java program.

1

1 Answers

1
votes

I am not sure that floating similarity for fuzzy query is allowed anymore in Lucene. From lucene-4.0 and later versions, FuzzyQuery supports maximum 2 edit distance.

Let's assume you want edit distance of 2. You can use Keyword Analyser while indexing your field. This will not tokenize your field values. While searching you can use FuzzyQuery with term containing full text.

Limitations of this solution:

  • Maximum edit distance is 2.
  • We are assuming that whatever you are looking up is a full value of that field. For example, if you indexed value is "this is my text", you cannot get the doument if you search with "this is ny"[made a mistake in query]. You can get this document if you query it as "this is ny text".