Lucene: Improving unranked search performance?

Question

I'm using Lucene 5.5.0 for indexing. The following criteria describe my environment:

The indexed documents consist of 8 fields each. They are the same for all documents in the corpus (all documents have the same "schema").
All fields are either String or Long fields (so no text analysis is required). All of them are stored by lucene. The strings have a maximum length of 255 characters.
The index is treated as "read-mostly", with 90% of all requests being (concurrent) reads. I am doing locking on the application level, so Lucene won't have to worry about concurrent reads and writes.
When searching the corpus, I do not require any ranking of the results. The order of retrieved document results can be entirely arbitrary.
The queries are typically a combination of boolean, regex and numeric range queries.
When searching the corpus, retrieving all documents matching the query is the top priority.

The current search method I've implemented, wrapping Lucene's API, looks like this:

public Set<Document> performLuceneSearch(Query query) {
        Set<Document> documents = Sets.newHashSet();
        // the reader instance is reused as often as possible, and exchanged
        // when a write occurs using DirectoryReader.openIfChanged(...).
        if (this.reader.numDocs() > 0) {
            // note that there cannot be a limiting number on the result set.
            // I absolutely need to retrieve ALL matching documents, so I have to
            // make use of 'reader.numDocs()' here.
            TopDocs topDocs = this.searcher.search(query, this.reader.numDocs());
            ScoreDoc[] scoreDocs = topDocs.scoreDocs;
            for (ScoreDoc scoreDoc : scoreDocs) {
                int documentId = scoreDoc.doc;
                Document document = this.reader.document(documentId);
                documents.add(document);
            }
        }
        return Collections.unmodifiableSet(documents);
}

Is there any way to do this faster/better, considering my environment outlined above? Especially given that I don't require any ranking or sorting (but rather completeness of the result), I feel that there should be some corners to cut and make things faster.

knutwalker knutwalker · Accepted Answer · 2016-04-09T21:31:09

There are a couple of things you can do to speed up the search. First, if you don't use scoring, you should disable norms, this will make the index smaller. Since you only use StringField and LongField (as opposed to, say, the TextField with a keyword tokenizer), norms are disabled for these Field, so you've already got that one.

Second, you should structure and wrap your query, so that you minimize the calculation of actual scores. That is, if you use BooleanQuery, use Occur.FILTER instead of Occur.MUST. Both have the same inclusion logic, but filter doesn't score. For other queries, consider wrapping them in a ConstantScoreQuery. However, this might not be necessary at all (explanation follows).

Third, use a custom Collector. The default search method is meant for small, ranked or sorted result sets, but your use case doesn't fit that pattern. Here is a sample implementation:

import org.apache.lucene.document.Document;
import org.apache.lucene.index.LeafReader;
import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.search.SimpleCollector;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;


final class AllDocumentsCollector extends SimpleCollector {

  private final List<Document> documents;
  private LeafReader currentReader;

  public AllDocumentsCollector(final int numDocs) {
    this.documents = new ArrayList<>(numDocs);
  }

  public List<Document> getDocuments() {
    return Collections.unmodifiableList(documents);
  }

  @Override
  protected void doSetNextReader(final LeafReaderContext context) {
    currentReader = context.reader();
  }

  @Override
  public void collect(final int doc) throws IOException {
    documents.add(currentReader.document(doc));
  }

  @Override
  public boolean needsScores() {
    return false;
  }
}

You would use it like this.

public List<Document> performLuceneSearch(final Query query) throws IOException {
  // the reader instance is reused as often as possible, and exchanged
  // when a write occurs using DirectoryReader.openIfChanged(...).
  final AllDocumentsCollector collector = new AllDocumentsCollector(this.reader.numDocs());
  this.searcher.search(query, collector);
  return collector.getDocuments();
}

The collector uses a list instead of a set. Document does not implement equals or hashCode, so you don't profit from a set and only pay for additional equality checks. The final order is the so called index order. The first document will be the one that comes first in the index (roughly insertion order, if you don't have custom merge strategies in place, but ultimately it's an arbitrary order that is not guaranteed to be stable or reliable). Also, the collector signals that no scores are needed, which gives you about he same benefits as using option 2 from above, so you can save yourself some trouble and just leave your query as they are right now.

Depending on what you need the Documents for, you can get an even greater speedup by using DocValues instead of stored fields. This is only true if you require only one or two of your fields, not all of them. The rule of thumb is, for few documents but many fields, use stored fields; for many documents but few fields, use DocValues. At any rate, you should experiment – 8 fields is not that much and you might profit event for all fields. Here is how you would use DocValues in your index process:

import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.NumericDocValuesField;
import org.apache.lucene.document.SortedDocValuesField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.util.BytesRef;

document.add(new StringField(fieldName, stringContent, Field.Store.NO));
document.add(new SortedDocValuesField(fieldName, new BytesRef(stringContent)));
// OR
document.add(new LongField(fieldName, longValue, Field.Store.NO));
document.add(new NumericDocValuesField(fieldName, longValue));

The fieldname can be the same and you can choose to not store your other fields if you can rely completely on DocValues. The the collector has to be changed, exemplary for one field:

import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.index.SortedDocValues;
import org.apache.lucene.search.SimpleCollector;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;


final class AllDocumentsCollector extends SimpleCollector {

  private final List<String> documents;
  private final String fieldName;
  private SortedDocValues docValues;

  public AllDocumentsCollector(final String fieldName, final int numDocs) {
    this.fieldName = fieldName;
    this.documents = new ArrayList<>(numDocs);
  }

  public List<String> getDocuments() {
    return Collections.unmodifiableList(documents);
  }

  @Override
  protected void doSetNextReader(final LeafReaderContext context) throws IOException {
    docValues = context.reader().getSortedDocValues(fieldName);
  }

  @Override
  public void collect(final int doc) throws IOException {
    documents.add(docValues.get(doc).utf8ToString());
  }

  @Override
  public boolean needsScores() {
    return false;
  }
}

You would use getNumericDocValues for the long fields, respectively. You have to repeat this (in the same collector of course) for all your fields that you have to load and most important: measure when its better to load full documents from the stored fields instead of using DocValues.

One final note:

I am doing locking on the application level, so Lucene won't have to worry about concurrent reads and writes.

The IndexSearcher and IndexWriter itself are already thread-safe. If you lock solely for Lucene, you can remove those locks and just share them amongst all your threads. And consider using oal.search.SearcherManager for reusing the IndexReader/Searcher.

Lucene: Improving unranked search performance?

1 Answers