4
votes

I need to read (15000) excel files for my thesis. I'm using apache poi to open and later to analyze them but after around 5000 files I'm getting the following exception and stacktrace:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.attr(Cur.java:3044)
at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.attr(Cur.java:3065)
at org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3263)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082)
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1822)
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362)
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4682)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:92)
at org.openxmlformats.schemas.spreadsheetml.x2006.main.WorksheetDocument$Factory.parse(Unknown Source)
at org.apache.poi.xssf.usermodel.XSSFSheet.read(XSSFSheet.java:173)
at org.apache.poi.xssf.usermodel.XSSFSheet.onDocumentRead(XSSFSheet.java:165)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.parseSheet(XSSFWorkbook.java:417)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.onDocumentRead(XSSFWorkbook.java:382)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:178)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:249)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:302)
at de.spreadsheet_realtions.analysis.WorkbookAnalysis.analyze(WorkbookAnalysis.java:18)

Code (at the moment just open the file and close the file):

public static void main(String[] args) {
    start();
}

public void start(){
    File[] files = getAllFiles(Config.folder);
    ZipSecureFile.setMinInflateRatio(0.00);
    for(File f: files){
        analyze(f);
    }
}

public void analyze(File file){
    Workbook  workbook = null;
    try {
        workbook = new XSSFWorkbook(file); //line 18
    } catch (Exception e1) {e1.printStackTrace(); return;}
//      later would be here the code to analyze the workbook
    try {
        workbook.close();
    } catch (Exception e) {e.printStackTrace();}
}

I tried also with OPCPackage.open(file) and I got the same result.

What I'm doing wrong or what can I do to solve this problem? Thanks for any help.


EDIT: The same for the code below.

try (XSSFWorkbook workbook = new XSSFWorkbook(file)){
} catch (Exception e1) {e1.printStackTrace(); return;}
2
It could be one very big file that causes an OOM based on the memory settings that you define for you java process. Can you try to run only with the one file where the OOM happens and see if this one alone already triggers the OOM?centic
Yes, it is one big file (42mb) and whiteout this file it is working :-) thank you.MichaD

2 Answers

5
votes

Typically, POI has the whole workbook in memory. So, a large workbook requires a different approach.

While writing, one can use SXSSF and most calls are the same, except that only a certain number of rows are in memory.

In your case, you are reading. For this you can use their "event driven" API. The basic idea here is that you do not get the workbook as one huge object. Instead, you get it piecemeal, as it is read, and you can save off as much as you wish into your own data-structure. Or, you can simply process it as you read it and not save very much.

Since this is a lower-level API (driven by the structure of the data being read), there is one approach for XLS and a different approach for XLSX. Look at the POI "How To" page, and find the section titled "XSSF and SAX (Event API)".

That example demonstrates how to detect the value of each cell as it is read in. (You'll need the xercesImpl.jar on your library path.)

2
votes

In the case of an exception in your first try block, you return, so you wouldn't close the workbook.

Put the close in a finally block.

Workbook workbook = null;
try {
  workbook = new XSSFWorkbook(file); //line 18

  // later would be here the code to analyze the workbook
} catch (Exception e1) {
  e1.printStackTrace(); return;
}  finally {
  if (workbook != null) workbook.close();
}

Or, better, use try-with-resources.

try (XSSFWorkbook workbook = new XSSFWorkbook(file) {
  // later would be here the code to analyze
} catch (Exception e1) {
  e1.printStackTrace();
}
// No need for explicit close.