0
votes

I'm going to bounty +100 this question when possible, even if it's already answered and accepted

I'm using Lucene 3.2, here's what I have in my index and code:

  • More than 10 fields per each indexed document.
  • OR operand in query phrase (ie: "my lucene search" goes "my OR lucene OR search").
  • MultiFieldQueryParser with Occur.SHOULD in all fields.
  • An specific default field containing all other fields (as proposed in this solution How to do a Multi field - Phrase search in Lucene?).

What am I trying to reach? A sort of Google-like search, let me explain:

  • Search in all fields
  • Scored results (with boost for specific fields, etc.)
  • Adding words to the query phrase should filter results

I'm reaching every aspect but this last one. My problems are the following:

  • If I search only in the default field containing all other fields, I don't get well-scored results
  • Searching only with AND operand I get way too filtered results, only getting the ones that have the whole query phrase in one field.
  • Searching only with OR operand works perfect with just one word in the query, but when adding more words to the query phrase, results increase significantly instead of getting filtered (just like Google does).
  • I don't know how to filter one query from another

This is my actual call to the query parser:

MultiFieldQueryParser.parse(
    Version.LUCENE_31,
    OrQueryWords, //query words separated with OR operand
    searchFields, //String[] searchFields; // all fields
    occurs, //Occur[] occurs; {Occur.SHOULD, Occur.SHOULD, etc..}
    getFullTextSession().getSearchFactory().getAnalyzer(Product.class)
);

The toString() of this query prints something like this:

(field1:"word1 word2" (field1:word1 field1:word2)) (field2:"word1 word2" (...)) etc.

Right now I'm trying to add the default field (the one containing all other fields) with query words separated with AND operand and Occur.MUST:

MultiFieldQueryParser.parse(
    Version.LUCENE_31,
    AndQueryWords, //query words separated with AND operand
    new String[] {"defaultField"},
    new Occur[] {Occur.MUST},
    getFullTextSession().getSearchFactory().getAnalyzer(Product.class)
);

The toString() of this query prints this:

+(default:"word1 word2" (+default:word1 +default:word2))

How can I intersect both queries? Is there any other solution to reach it?

2

2 Answers

1
votes

The approach I've used for solving a similar problem is based on limiting number of results by score.

Unfortunatelly, Lucene doesn't provide such feature out of the box and they also discourage this approach (http://wiki.apache.org/lucene-java/ScoresAsPercentages). Main concern is based on the fact that score's absolute value is meaningless.

I used score's relative value for filtering: I picked the highest score, then calculated minimal accepted score from it (let's say maxScore / 5) and left only those results which satisfied this criterion.

1
votes

I am not sure to understand what you exactly want to achieve, so I am going to give you a few hints on how to customize your scoring when dealing with multi-field multi-term queries.

Intersection of two queries

You seem to be happy with you conjuctive query on the default field resultset, and by your disjunctive query on all fields scoring. You can get the best of both worlds by using the latter as your main query and the former as a filter.

For example:

Query mainQuery, filterQuery;

BooleanQuery query = new BooleanQuery();

// add the main query for scoring
query.add(mainQuery, Occur.SHOULD);

// prevent the filter query to participate in the scoring
filter.setBoost(0);
// make the filter query required
query.add(filterQuery, Occur.MUST);

Minimum should match clauses

If AND-ing all clauses is too restrictive, and OR-ing all clauses is not restrictive enough, then you could do something in between by setting the minimum number of SHOULD clauses that must match so that a document appears in the resultset.

Then the difficult part is to find the right formula to compute the minimum number of SHOULD clauses which must match for optimal user experience.

For example, let's say you want the ceil of 3/4 of the SHOULD clauses to match. Starting with a two-clauses query and adding clauses up to 5 clauses would yield the following evolution of the number of results.

  • 2 terms => ceil(2 * 3 / 4) = 2: all clauses must match
  • 3 terms => ceil(3 * 3 / 4) = 3: 3/4 clauses must match (the new clauses is required, less results)
  • 4 terms => ceil(4 * 3 / 4) = 3: 3/4 clauses must match (one of the clauses is optional, more results)
  • 5 terms => ceil(5 * 3 / 4) = 4: 4/5 clauses must match (maybe more, maybe less results, depending on the co-occurrences of the new term with the 4 first ones)

Anyway, with this feature, the only way for the number of results to shrink as the number of clauses increases is to have a purely conjunctive query.