8
votes

To allow users to search across multiple fields with Lucene 3.5 I currently create and add a QueryParser for each field to be searched to a DisjunctionMaxQuery. This works great when using OR as the default operator but I now want to change the default operator to AND to get more accurate (and fewer) results.

Problem is, queryParser.setDefaultOperator(QueryParser.AND_OPERATOR) misses many documents since all terms must be in atleast 1 field.

For example, consider the following data for a document: title field = "Programming Languages", body field = "Java, C++, PHP". If a user were to search for Java Programming this particular document would not be included in the results since the title nor the body field contains all terms in the query although combined they do. I would want this document returned for the above query but not for the query HTML Programming.

I've considered a catchall field but I have a few problems with it. First, users frequently include per field terms in their queries (author:bill) which is not possible with a catchall field. Also, I highlight certain fields with FastVectorHighlighter which requires them to be indexed and stored. So by adding a catchall field I would have to index most of the same data twice which is time and space consuming.

Any ideas?

3
Regarding indexing a catchall field, have you observed a time/space hit that is cause for concern? My experience has been the indexing the same data in a specific stored field, and then adding to a generalized index-only field has a pretty minimal impact on performance or index size.femtoRgon
Also, I wonder what the end query's structure looks like. Particularly, how the dis-max queries are set up. Easy to kill your ability to get meaningful scores with them.femtoRgon
@femtoRgon disjunctionMaxQuery structure is like this: '((title:java title:programming) | (body:java body:programming))~0.2' You bring up a good point that adding a catchall field may have little impact as far time/space is concerned. I definitely considered it but decided against it as I would also like to keep the ability to search by field, such as author:bill. Not only do users use this feature but I use it behind the scenes. Thx.Chris Davi

3 Answers

7
votes

Guess I should have done a little more research. Turns out MultiFieldQueryParser provides the exact functionality I was looking for. For whatever reason I was creating a QueryParser for each field I wanted to search like this:

String[] fields = {"title", "body", "subject", "author"};
QueryParser[] parsers = new QueryParser[fields.length];      
for(int i = 0; i < parsers.length; i++)
{
   parsers[i] = new QueryParser(Version.LUCENE_35, fields[i], analyzer);
   parsers[i].setDefaultOperator(QueryParser.AND_OPERATOR);
}

This would result in a query like this:

(+title:java +title:programming) | (+body:java +body:programming)

...which is not what I was looking. Now I create a single MultiFieldQueryParser like this:

MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_35, new String[]{"title", "body", "subject"}, analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);

This gives me the query I was looking for:

+(title:java body:java) +(title:programming body:programming)

Thanks to @seeta and @femtoRgon for the help!

2
votes

Perhaps what you need is a combination of Boolean queries that capture the different combinations of fields and terms. In your given example, the query could be -

(title:Java AND body:programming) OR (title:programming AND body:Java).

I don't know if there's an existing Query class that generates this automatically for you, but I think that's what should be the ultimate query that's run on the index.

0
votes

You want to be able to search multiple fields with the same set of terms, then the question from your comment:

((title:java title:programming) | (body:java body:programming))~0.2

May not be the best implementation.

You're effectively getting either the score from the title, or the score from the body for the combined set of terms. The case where you hit java in the title and programming in the body would be given approx. equal weight to a hit on java in the body and no hit on programming.

I think a better structured query would be:

(title:java body:java)~0.2 (title:programming body:programming)~0.2

This makes more sense to me, since you want the dismax queries to limit score growing on multiple queries of the same term (in different fields), but you do want scoring to grow for hits on different terms, I believe.

If that sort of query structure gets you better score results, limiting results to a certain minimum score (a percentage of the max score returned, rather than a simple hard-coded value) may be adequate to prevent too-weak results from being seen.


I also still wouldn't count out indexing an all field. It's an implementation I've used before, while indexing BOTH the specific field and the catchall field, thus allowing both general querying and specific single-field queries. Index storage tends to be pretty lean for unstored terms, and it will generally help performance, if you find yourself having to create big, complicated queries to make up for not having it.

If you really want to be sure that it takes minimal storage, you can even turn off TermVectors for that field:

new Field(name, value, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO);

Although I don't know how much of a difference that would really make.