How to match against subsets of a search string in SOLR/lucene

Question

I've got an unusual situation. Normally when you search a text index you are searching for a small number of keywords against documents with a larger number of terms.

For example you might search for "quick brown" and expect to match "the quick brown fox jumps over the lazy dog".

I have the situation where I have lots of small phrases in my document store and I wish to match them against a larger query phrase.

For example if I have a query:

"the quick brown fox jumps over the lazy dog"

and the documents

"quick brown"
"fox over"
"lazy dog"

I'd like to find the documents that have a phrase that occurs in the query. In this case "quick brown" and "lazy dog" (but not "fox over" because although the tokens match it's not a phrase in the search string).

Is this sort of query possible with SOLR/lucene?

Robert Muir Robert Muir · Accepted Answer · 2011-02-05T17:06:10

It sounds like you want to use ShingleFilter in your analysis, so that you index word bigrams: so add ShingleFilterFactory at both query and index time.

At index time your documents are then indexed as such:

"quick brown" -> quick_brown
"fox over" -> fox_over
"lazy dog" -> lazy_dog

At query time your query becomes:

"the quick brown fox jumps over the lazy dog" -> "the_quick quick_brown brown_fox fox_jumps jumps_over over_the the_lazy lazy_dog"

This is still no good, by default it will form a phrase query. So in your query analyzer only add PositionFilterFactory after the ShingleFilterFactory. This "flattens" the positions in the query so that the queryparser treats the output as synonyms, which will yield a booleanquery with these subs (all SHOULD clauses, so its basically an OR query):

BooleanQuery:

the_quick OR
quick_brown OR
brown_fox OR
...

this should be the most performant way, as then its really just a booleanquery of termqueries.

How to match against subsets of a search string in SOLR/lucene

3 Answers