Efficiently returning a field of all query hits in Lucene

Question

I have a fairly large lucene index, and queries that can hit about 5000 documents or so. I am storing my application metadata in a field in lucene (apart from text contents), and need to quickly get to this small metadata field for all the 5000 hits. Currently, my code looks something like this:

MapFieldSelector field = new MapFieldSelector("metaData");
ScoreDoc[] hits = searcher.search(query, null, 10000).scoreDocs;
for (int i = 0; i < hits.length; i++) {
    int index_doc_id = hits[i].doc;
    Document hitDoc = searcher.doc(index_doc_id, field); // expensive esp with disk-based lucene index
    metadata = hitDoc.getFieldable("metaData").stringValue();
}

However, this is terribly slow because each call to searcher.doc() is pretty expensive. Is there a way to do a "batch" fetch of the field for all the hits that may be more responsive? Or any other way to make this work faster? (the only thing inside the ScoreDoc appears to be the Lucene doc id, which I understand should not be relied upon. Otherwise I would have maintained a Lucene doc id -> metadata map on my own.) Thanks!

Update: I am now trying to use FieldCache's like this:

String metadatas[] = org.apache.lucene.search.FieldCache.DEFAULT.getStrings(searcher.getIndexReader(), "metaData");

when I open the index, and upon a query:

int ldocId = hits[i].doc;
String metadata = metadatas[ldocId];

This is working well for me.

hi, Im having similary issue, but fieldcache.default.getStrings is not avaiable anymore in lucene4.5.1, do you know any other similar method? — ikel

femtoRgon femtoRgon · Accepted Answer · 2013-05-21T22:08:56

You're best bet on improving the performance, is to reduce the stored data as much as you can. If you have a large content field stored in the index, setting it to be indexed only, rather than stored would improve your performance. Storing content external to Lucene, to be fetched after a hit is found in the index, is often a better idea.

There is also the possiblity that there exists a better way to get to the end result you are looking for. I'm guessing that the 5000 sets of metadata aren't the end result here. Your analysis may be handled more easily on indexed data in Lucene, instead of by pulling it all out of the index first. No idea, based on what you've provided, whether this is possible in your case, but could certainly be worth a look.

Efficiently returning a field of all query hits in Lucene

1 Answers