Why does Lucene cause OOM when indexing large files?

Question

I’m working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07). I’m consistently receiving OutOfMemoryError: Java heap space, when trying to index large text files.

Example 1: Indexing a 5 MB text file runs out of memory with a 64 MB max. heap size. So I increased the max. heap size to 512 MB. This worked for the 5 MB text file, but Lucene still used 84 MB of heap space to do this. Why so much?

The class FreqProxTermsWriterPerField appears to be the biggest memory consumer by far according to JConsole and the TPTP Memory Profiling plugin for Eclipse Ganymede.

Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB max. heap size. Increasing the max. heap size to 1024 MB works but Lucene uses 826 MB of heap space while performing this. Still seems like way too much memory is being used to do this. I’m sure larger files would cause the error as it seems correlative.

I’m on a Windows XP SP2 platform with 2 GB of RAM. So what is the best practice for indexing large files? Here is a code snippet that I’m using:

// Index the content of a text file.
private Boolean saveTXTFile(File textFile, Document textDocument) throws MyException {           

        try {             

              Boolean isFile = textFile.isFile();
              Boolean hasTextExtension = textFile.getName().endsWith(".txt");

              if (isFile && hasTextExtension) {

                    System.out.println("File " + textFile.getCanonicalPath() + " is being indexed");
                    Reader textFileReader = new FileReader(textFile);
                    if (textDocument == null)
                          textDocument = new Document();
                    textDocument.add(new Field("content", textFileReader));
                    indexWriter.addDocument(textDocument);   // BREAKS HERE!!!!
              }                    
        } catch (FileNotFoundException fnfe) {
              System.out.println(fnfe.getMessage());
              return false;
        } catch (CorruptIndexException cie) {
              throw new MyException("The index has become corrupt.");
        } catch (IOException ioe) {
              System.out.println(ioe.getMessage());
              return false;
        }                    
        return true;
  }

I find it weird that FreqProxTermsWriterPerField should come up as a big consumer. When you use the Field(String, Reader) constructor, like you have done, it doesn't store term vectors. Can you please post the code about how you initialized IndexWriter, how this method is called and the post-processing, if any. — Shashikant Kore
Here is how I initialize the IndexWriter: indexWriter = new IndexWriter(indexDirectory, new StandardAnalyzer(),createFlag, MaxFieldLength.UNLIMITED); indexWriter.setMergeScheduler(new org.apache.lucene.index.SerialMergeScheduler()); indexWriter.setRAMBufferSizeMB(32); indexWriter.setMergeFactor(1000); indexWriter.setMaxFieldLength(Integer.MAX_VALUE); indexWriter.setUseCompoundFile(false); indexWriter.close(); — Paul Murdoch
Sorry about the formatting. Do you know how I can re-post and get the code snippets to look like my original post? — Paul Murdoch
org.apache.lucene.index.FreqProxTermsWriter$PostingList That is the class consuming the most memory by far. When OOM occurs I performed a heap dump and used jhat to analyze it. The number of instances of the class above is the number of unique text indexed before the OOM. — Paul Murdoch
Merge factor is very high. Default value is 10. At 1000, you may run out of file descriptors also. Try removing this option. — Shashikant Kore

Narayan Narayan · Accepted Answer · 2009-09-04T06:59:25

In response as a comment to Gandalf

I can see you are setting the setMergeFactor to 1000

the API says

setMergeFactor

public void setMergeFactor(int mergeFactor)

Determines how often segment indices are merged by addDocument(). With smaller values, less RAM is used while indexing, and searches on unoptimized indices are faster, but indexing speed is slower. With larger values, more RAM is used during indexing, and while searches on unoptimized indices are slower, indexing is faster. Thus larger values (> 10) are best for batch index creation, and smaller values (< 10) for indices that are interactively maintained.

This method is a convenience method, it uses the RAM as you increase the mergeFactor

What i would suggest is set it to something like 15 or so on.; (on a trial and error basis) complemented with setRAMBufferSizeMB, also call Commit(). then optimise() and then close() the indexwriter object.(probably make a JavaBean and put all these methods in one method) call this method when you are closing the index.

post with your result, feedback =]

Why does Lucene cause OOM when indexing large files?

5 Answers