ANTLR4 listener for CSV grammar causes OutOfMemoryError for big files

Question

I'm having a relativley simple ANTLR4 grammar for csv files that may contain a header line and then only dataline that are seperated with spaces. The values are as follow Double Double Int String Date Time where Date is in yyyy-mm-dd format and Time is in hh:mm:ss.xxx format.

This resulted in the following grammar:

grammar CSVData;

start       :   (headerline | dataline) (NL dataline)* ;

headerline  :   STRING (' ' STRING)* ;
dataline    :   FLOAT ' ' FLOAT ' ' INT ' ' STRING ' ' DAY ' ' TIME ; //lat lon floor hid day time

NL          :   '\r'? '\n' ;
DAY         :   INT '-' INT '-' INT ; //yyyy-mm-dd
TIME        :   INT ':' INT ':' INT '.' INT ; //hh:mm:ss.xxx
INT         :   DIGIT+ ;
FLOAT       :   '-'? DIGIT* '.' DIGIT+ ;
STRING      :   LETTER (LETTER | DIGIT | SPECIALCHAR)* | (DIGIT | SPECIALCHAR)+ LETTER (LETTER | DIGIT | SPECIALCHAR)* ;

fragment LETTER     :   [A-Za-z] ;
fragment DIGIT      :   [0-9] ;
fragment SPECIALCHAR:   [_:] ;

In my Java application I use a listener that extends CSVDataBaseListener and only overwrites the enterDataline(CSVDataParser.DatalineContext ctx) method. There I simply fetch the tokens and create one object for every line.

When loading a file of 10 MB this all works as intended. But when I try to load a file of 110 MB size my application will cause an OutOfMemoryError: GC overhead limit exceeded. Im running my application with 1 GB of RAM and the filesize shouldn't be a problem in my opinion.

I also tried writing a parser simply in Java itself that uses String.split(" "). This parser works as intended, also for the 110 MB input file.

To get an estimation of the size of the objects I created I simply serialized my objects as suggested in this answer. The resulting size for the 110 MB input file was 86,513,392 Bytes, which is far away from consuming the 1 GB RAM.

So I'd like to know why ANTLR needs so much RAM for such a simple grammar. Is there any way to make my grammar better, so ANTLR is using less memory?

EDIT

I made some deeper memory analysis by loading a file with 1 million lines (approx. 77 MB on disk). For every single line my grammar finds 12 tokens (the six values per line plus five spaces and one new line). This can be stripped down to six tokens per line if the grammar ignores whitespace, but that's still a lot worse than writing a parser by yourself.

For 1 million input lines the memory dumps had the following size:

My grammar above: 1,926 MB
The grammar finding six tokens per line: 1,591 MB
My self-written parser: 415 MB

So having less tokens also results in less memory being used, but still for simple grammars, I'd recommend writing an own parser, because it's not that hard anyway plus you can save a lot of memory usage from the ANTLR overhead.

Sam Harwell Sam Harwell · Accepted Answer · 2014-09-25T11:30:47

According to your grammar, I'm going to assume that your input uses ASCII characters. If you store the file on disk as UTF-8, then simply loading the file into the ANTLRInputStream, which uses UTF-16, will consume 220MB. In addition to that you'll have overhead of approximately 48 bytes per CommonToken (last I checked), along with overhead from the DFA cache and the ParserRuleContext instances.

The only way to get an accurate picture of the memory used by a Java application is through a profiler, and in 64-bit mode not all profilers properly account for Compressed OOP object storage (YourKit does though). The first thing to try is simply increasing the allowed heap size. Once you know the specific data structure(s) using the memory, you can target that area for reduction.

ANTLR4 listener for CSV grammar causes OutOfMemoryError for big files

1 Answers