Lucene Indexing with 20 M Records taking more time

Question

I have the following Lucene code for indexing, when I run this code with 1 million records - it running fast (indexing in 15 seconds (both local and server with high configuration)).

When I try to index 20 million records, its taking about 10 minutes to complete the indexing.

I am running this 20 million records in Linux Server with more than 100 GB RAM. Is setting more RAM Buffer size will help in this case? if yes how much RAM size can set in my case (where i have like more than 100 GB RAM)

I tried the same 20 million records in my local machine(8 GB RAM), it took the same ten minutes, i tried setting 1 GB RAM Buffer size same 10 minutes in local, without setting any RAM Buffer also same 10 minutes for 20 million records in my local machine.

I tried without setting RAM buffer size in linux,it took about 8 minutes for 20 million records.

final File docDir = new File(docsPath.getFile().getAbsolutePath());
LOG.info("Indexing to directory '" + indexPath + "'...");
Directory dir = FSDirectory.open(new File(indexPath.getFile().getAbsolutePath()));
Analyzer analyzer = null;
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_47, analyzer);
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
iwc.setRAMBufferSizeMB(512.0);
IndexWriter indexWriter = new IndexWriter(dir, iwc);

if (docDir.canRead()) {
    if (docDir.isDirectory()) {
        String[] files = docDir.list();
        if (files != null) {

            for (int i = 0; i < files.length; i++) {
                File file = new File(docDir, files[i]);
                String filePath = file.getPath();
                String delimiter = BatchUtil.getProperty("file.delimiter");
                if (filePath.indexOf("ecid") != -1) {
                    indexEcidFile(indexWriter, file, delimiter);
                } else if (filePath.indexOf("entity") != -1) {
                    indexEntityFile(indexWriter, file, delimiter);
                }
            }
        }
    }
}
indexWriter.forceMerge(2);
indexWriter.close();

And one of the method used for indexing:

private void indexEntityFile(IndexWriter writer, File file, String delimiter) {

    FileInputStream fis = null;
    try {
        fis = new FileInputStream(file);
        BufferedReader br = new BufferedReader(new InputStreamReader(fis, Charset.forName("UTF-8")));

        Document doc = new Document();
        Field four_pk_Field = new StringField("four_pk", "", Field.Store.NO);
        doc.add(four_pk_Field);
        Field cust_grp_cd_Field = new StoredField("cust_grp_cd", "");
        Field cust_grp_mbrp_id_Field = new StoredField("cust_grp_mbrp_id", "");
        doc.add(cust_grp_cd_Field);
        doc.add(cust_grp_mbrp_id_Field);
        String line = null;

        while ((line = br.readLine()) != null) {

            String[] lineTokens = line.split("\\" + delimiter);
            four_pk_Field.setStringValue(four_pk);
            String cust_grp_cd = lineTokens[4];
            cust_grp_cd_Field.setStringValue(cust_grp_cd);
            String cust_grp_mbrp_id = lineTokens[5];
            cust_grp_mbrp_id_Field.setStringValue(cust_grp_mbrp_id);
            writer.addDocument(doc);
        }
        br.close();
    } catch (FileNotFoundException fnfe) {
        LOG.error("", fnfe);
    } catch (IOException ioe) {
        LOG.error("", ioe);
    } finally {
        try {
            fis.close();
        } catch (IOException e) {
            LOG.error("", e);
        }
    }
}

Any ideas?

Mysterion Mysterion · Accepted Answer · 2015-02-10T09:48:51

This happens, because you try to index all 20 million documents in 1 commit (and Lucene need to hold all 20 millions docs in memory). What should be done to fix it - is to add

writer.commit()

in indexEntityFile method, every X added documents. X could be 1 million or something like

Code could look like this (just show approach, you need to modify this code for your need)

int numberOfDocsInBatch = 0;
...
writer.addDocument(doc);
numberOfDocsInBatch ++;
if (numberOfDocsInBatch == 1_000_000) {
   writer.commit();
   numberOfDocsInBatch = 0;
}

Lucene Indexing with 20 M Records taking more time

1 Answers