High performance unique document id retrieval

Question

Currently I am working on high-performance NRT system using Lucene 4.9.0 on Java platform which detects near-duplicate text documents.

For this purpose I query Lucene to return some set of matching candidates and do near-duplicate calculation locally (by retrieving and caching term vectors). But my main concern is performance issue of binding Lucene's docId (which can change) to my own unique and immutable document id stored within index.

My flow is as follows:

query for documents in Lucene
for each document:
- fetch my unique document id based on Lucene docId
- get term vector from cache for my document id (if it doesn't exists - fetch it from Lucene and populate the cache)
- do maths...

My major bottleneck is "fetch my unique document id" step which introduces huge performance degradation (especially that sometimes I have to do calculation for, let's say, 40000 term vectors in single loop).

    try {
        Document document = indexReader.document(id);
        return document.getField(ID_FIELD_NAME).numericValue().intValue();
    } catch (IOException e) {
        throw new IndexException(e);
    }

Possible solutions I was considering was:

try of using Zoie which handles unique and persistent doc identifiers,
use of FieldCache (still very inefficient),
use of Payloads (according to http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html) - but I do not have any idea how to apply it.

Any other suggestions?

meaclum meaclum · Accepted Answer · 2014-07-22T09:01:03

I have figured out how to solve the issue partially using benefits of Lucene's AtomicReader. For this purpose I use global cache in order to keep already instantiated segments' FieldCache.

Map<Object, FieldCache.Ints> fieldCacheMap = new HashMap<Object, FieldCache.Ints>();

In my method I use the following piece of code:

Query query = new TermQuery(new Term(FIELD_NAME, fieldValue));
IndexReader indexReader = DirectoryReader.open(indexWriter, true);

List<AtomicReaderContext> leaves = indexReader.getContext().leaves();

// process each segment separately
for (AtomicReaderContext leave : leaves) {
    AtomicReader reader = leave.reader();

    FieldCache.Ints fieldCache;
    Object fieldCacheKey = reader.getCoreCacheKey();

    synchronized (fieldCacheMap) {
        fieldCache = fieldCacheMap.get(fieldCacheKey);
        if (fieldCache == null) {
            fieldCache = FieldCache.DEFAULT.getInts(reader, ID_FIELD_NAME, true);
            fieldCacheMap.put(fieldCacheKey, fieldCache);
        }
        usedReaderSet.add(fieldCacheKey);
    }

    IndexSearcher searcher = new IndexSearcher(reader);
    TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);

    ScoreDoc[] scoreDocs = topDocs.scoreDocs;

    for (int i = 0; i < scoreDocs.length; i++) {
         int docID = scoreDocs[i].doc;
         int offerId = fieldCache.get(docID);
         // do your processing here
    }
}

// remove unused entries in cache set
synchronized(fieldCacheMap) {
    Set<Object> inCacheSet = fieldCacheMap.keySet();
    Set<Object> toRemove = new HashSet();
    for(Object inCache : inCacheSet) {
        if(!usedReaderSet.contains(inCache)) {
            toRemove.add(inCache);
        }
    }

    for(Object subject : toRemove) {
         fieldCacheMap.remove(subject);
    }

}

indexReader.close();

It works pretty fast. My main concern is memory usage which can be really high when using large index.

High performance unique document id retrieval

1 Answers