0
votes

I have pairs of search strings, and I want to use Lucene to search for sentences that contain all terms that are contained in these strings. So for example if I have the two search strings "white shark" and "fish", I want all sentences containing both "white", "shark" and "fish". Apparently, with Lucene this can be done rather easily by means of a boolean query; this is how I do it in my code:

String search =  str1+" "+ str2;
BooleanQuery booleanQuery = new BooleanQuery();
QueryParser queryParser = new QueryParser(...);
queryParser.setDefaultOperator(QueryParser.Operator.AND);
booleanQuery.add(queryParser.parse(search), BooleanClause.Occur.MUST);

However, I also have pairs of search strings where one string is a subpart of the other, such as e.g. "timber wolf" and "wolf", and in these cases I would like to only get sentences that contain "wolf" at least twice (and "timber" at least once). Is there any way to achieve this with Lucene? Many thanks in advance for your answers!

1

1 Answers

0
votes

Keep in mind, that a doc that has both "timber wolf" and a separate "wolf" will get scored higher all else being equal, since the term "wolf" occurs twice giving it a higher tf score. In most cases like this, the desired results being the first ones is acceptable, usually even desirable.

That said, I believe you could get what you want using a phrase query with slop, and having the slop set adequately high. Something like:

"timber wolf wolf"~10000

Which is probably high enough for most cases. This would require two instances of wolf and one of timber.

However, if you need timber wolf to appear (that is, those two terms adjacent and in order), you'll need to abandon the query parser, and construct the appropriate queries yourself. SpanQueries, specifically.

SpanQuery wolfQuery = new SpanTermQuery(new Term("myField", "wolf"));
SpanQuery[] timberWolfSubQueries = {
    new SpanTermQuery(new Term("myField", "timber")),
    new SpanTermQuery(new Term("myField", "wolf"))
};
//arguments "0, true" mean 0 slop and in order (respectively)
SpanQuery timberWolfQuery = new SpanNearQuery(timberWolfSubQueries, 0, true);
SpanQuery[] finalSubQueries = {
    wolfQuery, timberWolfQuery
};
//arguments "10000, false" mean 10000 slop and not (necessarily) in order
SpanQuery finalQuery = new SpanNearQuery(finalSubQueries, 10000, false);