1
votes

I've 3 records in Lucene index.

Record 1 contains healthcare in title field. Record 2 contains healthcare and insurance in description field but not together. Record 3 contains healthcare insurance in company name field.

When a user searches for healthcare insurance,I want to show records in the following order in search results...

a.Record #3---because it contains both the words of the input together(ie.as a phrase) b.Record #1 c.Record #2

To put it another way, exact match of all keywords should be given more weight than matches of individual keywords.

How do i achieve this in lucene?

Thanks.

2
Record #1 doesn't contain one of the query term "insurance" and you would like it be ranked at #2. Is that correct?Shashikant Kore

2 Answers

1
votes

You can use phrase + slop as bajafresh4life says, but it will fail to match anything if the terms are more than slop apart.

A slightly more complicated alternative is to construct a boolean query that explicitly searches for the phrase (with or without slop) and each of the terms in the phrase. E.g.

"healthcare insurance" OR healthcare OR insurance

Normal lucene relevance sort will give you what you want, and won't fail in the way that the "big slop" approach will.

You can also boost individual fields so that, for example, title is weighted more heavily than description or company name. This needs an even more complicated query, but gives you a lot more control over the ordering...

title:"healthcare insurance"^2 OR title:healthcare^2 OR title:insurance^2
OR description:"healthcare insurance" OR ...

It can be quite tricky to get the weights right, and you may have to play around with them to get exactly what you want (e.g. in the example I just gave, you might not want to boost the individual terms for title), but when you get it working, its pretty nice :-)

1
votes

Rewrite the query with a phrase + slop factor. So if the query is:

healthcare insurance

you can rewrite it as:

"healthcare insurance"~100

Documents that have the words "healthcare" and "insurance" closer in proximity to each other will be scored higher. In this case, since the slop factor is 100, documents that have both words but are more than 100 terms apart will not match.

Rewriting the query involves manipulating the Term objects in a BooleanQuery. Take all the terms, create a PhraseQuery, and set a slop factor.