I have written a program to index database data to disk and I am not sure if my indexing speed is appropriate i.e. if I am very slow or not and if speed can be further improved.
Speed that I get is around 15000 Documents per Hour which amounts to around 2600 KB of Index Directory Size for creation of new indices.
I am using Lucene 6.0.0 and Windows 8.1 64 bit OS, 16 GB RAM and Intel Core i7 8 Core machine. I am doing indexing on local machine and not sure what kind of disks I have, its the usual one that comes with Windows PC.
I am using Spring Batch to INNER JOIN two database tables and get a Row Mapped Object from ItemReader then I prepare Document from this object.
I am always using method, writer.updateDocument(contentDuplicateKeyTerm, doc); and not addDocument(doc) since in Lucene 6.0.0 updateDocument adds a document to index if document doesn't already exist in addition to updating existing document.
I am not aware of any bench mark to compare my program to.
Please suggest.
EDIT: Now, I am able to achieve performance of around 1,80,000 documents per hour. Issue was doing IndexWriter.commit() after updating each Document, now I commit at regular intervals and that has improved performance greatly.
addDocumentandupdateDocument. If you know you're not inserting duplicates, you might want to useaddDocument. - Marko TopolnikupdateDocumentbecause duplicates might try to get in ( As of now, no way to filter out in advance) and we don't want duplicates in indices. - Sabir Khan