5
votes

I've got an unusual situation. Normally when you search a text index you are searching for a small number of keywords against documents with a larger number of terms.

For example you might search for "quick brown" and expect to match "the quick brown fox jumps over the lazy dog".

I have the situation where I have lots of small phrases in my document store and I wish to match them against a larger query phrase.

For example if I have a query:

  • "the quick brown fox jumps over the lazy dog"

and the documents

  • "quick brown"
  • "fox over"
  • "lazy dog"

I'd like to find the documents that have a phrase that occurs in the query. In this case "quick brown" and "lazy dog" (but not "fox over" because although the tokens match it's not a phrase in the search string).

Is this sort of query possible with SOLR/lucene?

3

3 Answers

4
votes

It sounds like you want to use ShingleFilter in your analysis, so that you index word bigrams: so add ShingleFilterFactory at both query and index time.

At index time your documents are then indexed as such:

  • "quick brown" -> quick_brown
  • "fox over" -> fox_over
  • "lazy dog" -> lazy_dog

At query time your query becomes:

  • "the quick brown fox jumps over the lazy dog" -> "the_quick quick_brown brown_fox fox_jumps jumps_over over_the the_lazy lazy_dog"

This is still no good, by default it will form a phrase query. So in your query analyzer only add PositionFilterFactory after the ShingleFilterFactory. This "flattens" the positions in the query so that the queryparser treats the output as synonyms, which will yield a booleanquery with these subs (all SHOULD clauses, so its basically an OR query):

BooleanQuery:

  • the_quick OR
  • quick_brown OR
  • brown_fox OR
  • ...

this should be the most performant way, as then its really just a booleanquery of termqueries.

2
votes

Sounds like you want the DisMax "minimum match" parameter. I wrote a blog article on the concept here a little while: http://blog.websolr.com/post/1299174416. There's also the Solr wiki on minimum match.

The "minimum match" concept is applied against all the "optional" terms in your query -- terms that aren't explicitly specified, using +/-, whether they are "+mandatory" or "-prohibited". By default, the minimum match is 100%, meaning that 100% of the optional terms must be present. In other words, all of your terms are considered mandatory.

This is why your longer query isn't currently matching documents containing shorter fragments of that phrase. The other keywords in the longer search phrase are treated as mandatory.

If you drop the minimum match down to 1, then only one of your optional terms will be considered mandatory. In some ways this is the opposite of the default of 100%. It's like your query of quick brown fox… is turned into quick OR brown OR fox OR … and so on.

If you set your minimum match to 2, then your search phrase will get broken up into groups of two terms. A search for quick brown fox turns into (quick brown) OR (brown fox) OR (quick fox) … and so on. (Excuse my psuedo-query there, I trust you see the point.)

The minimum match parameter also supports percentages -- say, 20% -- and some even more complex expressions. So there's a fair amount of tweakability.

1
votes

only setting mm parameter will not satisfy your needs since

"the quick brown fox jumps over the lazy dog"

will match all three documents

  • "quick brown"
  • "fox over"
  • "lazy dog"

and as you said:

I'd like to find the documents that have a phrase that occurs in the query. In this case "quick brown" and "lazy dog" (but not "fox over" because although the tokens match it's not a phrase in the search string).