I'm having a relativley simple ANTLR4 grammar for csv files that may contain a header line and then only dataline that are seperated with spaces.
The values are as follow Double Double Int String Date Time
where Date
is in yyyy-mm-dd
format and Time
is in hh:mm:ss.xxx
format.
This resulted in the following grammar:
grammar CSVData;
start : (headerline | dataline) (NL dataline)* ;
headerline : STRING (' ' STRING)* ;
dataline : FLOAT ' ' FLOAT ' ' INT ' ' STRING ' ' DAY ' ' TIME ; //lat lon floor hid day time
NL : '\r'? '\n' ;
DAY : INT '-' INT '-' INT ; //yyyy-mm-dd
TIME : INT ':' INT ':' INT '.' INT ; //hh:mm:ss.xxx
INT : DIGIT+ ;
FLOAT : '-'? DIGIT* '.' DIGIT+ ;
STRING : LETTER (LETTER | DIGIT | SPECIALCHAR)* | (DIGIT | SPECIALCHAR)+ LETTER (LETTER | DIGIT | SPECIALCHAR)* ;
fragment LETTER : [A-Za-z] ;
fragment DIGIT : [0-9] ;
fragment SPECIALCHAR: [_:] ;
In my Java application I use a listener that extends CSVDataBaseListener
and only overwrites the enterDataline(CSVDataParser.DatalineContext ctx)
method. There I simply fetch the tokens and create one object for every line.
When loading a file of 10 MB this all works as intended. But when I try to load a file of 110 MB size my application will cause an OutOfMemoryError: GC overhead limit exceeded
.
Im running my application with 1 GB of RAM and the filesize shouldn't be a problem in my opinion.
I also tried writing a parser simply in Java itself that uses String.split(" ")
. This parser works as intended, also for the 110 MB input file.
To get an estimation of the size of the objects I created I simply serialized my objects as suggested in this answer. The resulting size for the 110 MB input file was 86,513,392 Bytes, which is far away from consuming the 1 GB RAM.
So I'd like to know why ANTLR needs so much RAM for such a simple grammar. Is there any way to make my grammar better, so ANTLR is using less memory?
EDIT
I made some deeper memory analysis by loading a file with 1 million lines (approx. 77 MB on disk). For every single line my grammar finds 12 tokens (the six values per line plus five spaces and one new line). This can be stripped down to six tokens per line if the grammar ignores whitespace, but that's still a lot worse than writing a parser by yourself.
For 1 million input lines the memory dumps had the following size:
- My grammar above: 1,926 MB
- The grammar finding six tokens per line: 1,591 MB
- My self-written parser: 415 MB
So having less tokens also results in less memory being used, but still for simple grammars, I'd recommend writing an own parser, because it's not that hard anyway plus you can save a lot of memory usage from the ANTLR overhead.