I have written a program to index database data to disk and I am not sure if my indexing speed is appropriate i.e. if I am very slow or not and if speed can be further improved.
Speed that I get is around 15000 Documents per Hour which amounts to around 2600 KB of Index Directory Size for creation of new indices.
I am using Lucene 6.0.0 and Windows 8.1 64 bit OS, 16 GB RAM and Intel Core i7 8 Core machine. I am doing indexing on local machine and not sure what kind of disks I have, its the usual one that comes with Windows PC.
I am using Spring Batch to INNER JOIN
two database tables and get a Row Mapped Object from ItemReader
then I prepare Document
from this object.
I am always using method, writer.updateDocument(contentDuplicateKeyTerm, doc);
and not addDocument(doc)
since in Lucene 6.0.0 updateDocument
adds a document to index if document doesn't already exist in addition to updating existing document.
I am not aware of any bench mark to compare my program to.
Please suggest.
EDIT: Now, I am able to achieve performance of around 1,80,000 documents per hour. Issue was doing IndexWriter.commit()
after updating each Document
, now I commit at regular intervals and that has improved performance greatly.
addDocument
andupdateDocument
. If you know you're not inserting duplicates, you might want to useaddDocument
. – Marko TopolnikupdateDocument
because duplicates might try to get in ( As of now, no way to filter out in advance) and we don't want duplicates in indices. – Sabir Khan