I’m working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07). I’m consistently receiving OutOfMemoryError: Java heap space
, when trying to index large text files.
Example 1: Indexing a 5 MB text file runs out of memory with a 64 MB max. heap size. So I increased the max. heap size to 512 MB. This worked for the 5 MB text file, but Lucene still used 84 MB of heap space to do this. Why so much?
The class FreqProxTermsWriterPerField
appears to be the biggest memory consumer by far according to JConsole and the TPTP Memory Profiling plugin for Eclipse Ganymede.
Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB max. heap size. Increasing the max. heap size to 1024 MB works but Lucene uses 826 MB of heap space while performing this. Still seems like way too much memory is being used to do this. I’m sure larger files would cause the error as it seems correlative.
I’m on a Windows XP SP2 platform with 2 GB of RAM. So what is the best practice for indexing large files? Here is a code snippet that I’m using:
// Index the content of a text file.
private Boolean saveTXTFile(File textFile, Document textDocument) throws MyException {
try {
Boolean isFile = textFile.isFile();
Boolean hasTextExtension = textFile.getName().endsWith(".txt");
if (isFile && hasTextExtension) {
System.out.println("File " + textFile.getCanonicalPath() + " is being indexed");
Reader textFileReader = new FileReader(textFile);
if (textDocument == null)
textDocument = new Document();
textDocument.add(new Field("content", textFileReader));
indexWriter.addDocument(textDocument); // BREAKS HERE!!!!
}
} catch (FileNotFoundException fnfe) {
System.out.println(fnfe.getMessage());
return false;
} catch (CorruptIndexException cie) {
throw new MyException("The index has become corrupt.");
} catch (IOException ioe) {
System.out.println(ioe.getMessage());
return false;
}
return true;
}