Lucene Indexing Performance

Question

I have written a program to index database data to disk and I am not sure if my indexing speed is appropriate i.e. if I am very slow or not and if speed can be further improved.

Speed that I get is around 15000 Documents per Hour which amounts to around 2600 KB of Index Directory Size for creation of new indices.

I am using Lucene 6.0.0 and Windows 8.1 64 bit OS, 16 GB RAM and Intel Core i7 8 Core machine. I am doing indexing on local machine and not sure what kind of disks I have, its the usual one that comes with Windows PC.

I am using Spring Batch to INNER JOIN two database tables and get a Row Mapped Object from ItemReader then I prepare Document from this object.

I am always using method, writer.updateDocument(contentDuplicateKeyTerm, doc); and not addDocument(doc) since in Lucene 6.0.0 updateDocument adds a document to index if document doesn't already exist in addition to updating existing document.

I am not aware of any bench mark to compare my program to.

Please suggest.

EDIT: Now, I am able to achieve performance of around 1,80,000 documents per hour. Issue was doing IndexWriter.commit() after updating each Document, now I commit at regular intervals and that has improved performance greatly.

You can't expect people to telepathically diagnose your performance problem. Dissect the performance into query, Lucene, and disk output, and try to identify the bottleneck. Also, if you haven't already, get informed on the expected performance difference between addDocument and updateDocument. If you know you're not inserting duplicates, you might want to use addDocument. — Marko Topolnik
Yes, you are correct. I am not saying that I have a performance problem, I just want to know as what is considered normal speed. I have edited my question. One flaw that I found in my code was doing commits for each Document. I am using updateDocument because duplicates might try to get in ( As of now, no way to filter out in advance) and we don't want duplicates in indices. — Sabir Khan
Commits make a huge difference. The expected speed is "very high", by itself it should max out rotational disk throughput (if that's what you consider "the usual one"). — Marko Topolnik

Sabir Khan Sabir Khan · Accepted Answer · 2017-01-12T04:14:43

I was making multiple mistakes and that is why write performance was slow. Some of mistakes and rectifications were:

I was committing after each document, so I changed the program to commit after each chunk, as I am using Spring Batch. Increasing commit interval improved performance significantly.
I was closing and reopening writer instances unnecessarily ( initially the logic was designed to do so ). I changed the logic to maintain a single writer instance in the application scope and keep reusing it as needed.
Source data was from a DB2 database and reading was slow from tables. I added indexes to increase read performance.
Lucene writer is thread safe so I started writing in a multi threaded way instead of using a single thread.

So after increasing Lucene writer commit interval, indexing itself doesn't take as much time provided I have enough memory to hold large sets of documents. Document read and preparation doesn't take as much time. Lucene can index a few million documents in just a couple of minutes on modern machines.

Lucene Indexing Performance

1 Answers