3
votes

Hello stackOverflowers

I have a field in a Solr document collection with a field called names_txt - this is a multiValue="true" field.

This field contains all the names of the associated persons to a document

I want to be able to both do a fuzzy search and at the same time limit the number of terms between the to matching terms.

The query

names_txt:("markus foss"~2)

Will return all documents where you find the terms markus and foss where theres max 2 terms between them.

But when i search in a fuzzy way AND want to also specify the max number of terms between the matches, I cant get the syntax right.

The query:

names_txt:(markus~0.7 foss~0.7)

This does work, but returns false postives, since it will return a document with "markus something" in one value, and "foss somethingElse" in another.

What I would like to write is:

(markus~0.7 foss~0.7)~2
  • but this syntax is illegal in solr.

Anyone out there have a solution for my problem?

2

2 Answers

1
votes

Since in one single query term Solr can either process a word distance restraint or a fuzzy search restraint, we will need two terms for this:

names_txt:("markus foss"~2) AND names_txt:(markus~0.7 foss~0.7)

Note that quantifying fuzzyness by a float number is deprecated. Internally, lucene converts converts the float number to an int between 0 and 2 anyway, so we should use this integer (Damereau Levenshtein) edit distance right from the beginning in our search terms. So my final proposal states:

names_txt:("markus foss"~2) AND names_txt:(markus~1 foss~1)

(For those who are interested: The deprecated, somewhat quirky function that converts the similarity float to an edit distance int can be found at the end of this code file.)

0
votes

I think you could do that using SpanQuery The issue is that the usual query parsers in Solr dont support them. Look at this article that mentions those that support spans: Surround, Xml-Query-Parser and Qsol. But check the status of each in current solr version.