0
votes

I am looking for a way to filtering lucene index with multiple conditions. For this purpose I checked two different method of filtering search, none of them work for me:

Using BooleanQuery:

BooleanQuery query = new BooleanQuery();
    String lower = "*";
    String upper = "*";
    for (String fieldName : keywordSourceFields) {
      TermRangeQuery rangeQuery = TermRangeQuery.newStringRange(fieldName,
          lower, upper, true, true);
      query.add(rangeQuery, Occur.MUST);
    }
    TermRangeQuery rangeQuery = TermRangeQuery.newStringRange(keywordField,
        lower, upper, true, true);
    query.add(rangeQuery, Occur.MUST_NOT);
    try {
      TopDocs results = searcher.search(query, null,
          maxNumDocs);

Using BooleanFilter:

BooleanFilter filter = new BooleanFilter();
    String lower = "*";
    String upper = "*";
    for (String fieldName : keywordSourceFields) {
      TermRangeFilter rangeFilter = TermRangeFilter.newStringRange(fieldName,
          lower, upper, true, true);
      filter.add(rangeFilter, Occur.MUST_NOT);
    }
    TermRangeFilter rangeFilter = TermRangeFilter.newStringRange(keywordField,
        lower, upper, true, true);
    filter.add(rangeFilter, Occur.MUST);
    try {
      TopDocs results = searcher.search(new MatchAllDocsQuery(), filter,
          maxNumDocs);

I was wondering what part of chosen queries are wrong? I am looking for documents that for each keywordSourceFields, the field has some value AND also has not value for keyword field. Please guide me through correcting the corresponding query.

Best regards.

1

1 Answers

2
votes

Firstly, it would be a much better idea to index a default value for empty fields. Each subquery you are combining here has to enumerate and search for all the available values for the field to determine none exist. Likely it will be very slow.

Passing a * in as a query term is not a valid way to construct an open-ended range query. null is the correct value to pass in for that. Passing in a null as a lower query term and includeLower = true will result in an exception (since it doesn't make sense).

Also, TermRangeQuery does not allow both ends to be null, and will throw an exception for that. As such, at least one end of the query must be a defined term. You'll need to come up with either a safe upper bound or lower bound to use.

So, you can do something like:

Query subQuery = new TermRangeQuery("myField", "aaaaaaaaa", null, true, false);

Or using filter you could have:

Filter subFilter = new TermRangeFilter.More(myField, new BytesRef("aaaaaaaaa"));

This is bit hacky, of course, and again, performance will be awful. You can mitigate that using caching filters, but indexing your data with a default value to search for in the case of an empty field is really what you should be doing. Lucene is most useful and performant when you index your data in a way that supports the kinds of search you want to do.