I'm failry new to ANTLR and I'm trying to write a parser for DXF files with ANTLRv4. DXF files use so called group codes to specify the type of the following data.
Example excerpt from some DXF file:
0
SECTION
2
HEADER
9
$ORTHOMODE
70
0
9
0
ENDSEC
For example the first 0
means that in the next line a String follows. The group code 70
means that an 16Bit Integer will follow, in the example it's a 0
.
My problem now is e.g. how can distinguish between the group code 0
and Integer 0
.
In the example snippet it seems that Integer values have some special indentation, but I couldn't find anything about this in the DXF reference.
My idea so far was following ANTLR grammar:
grammar SimpleDXF;
start : HEADER variable* ENDSEC ;
variable : varstart (groupcode NL value NL)+ ;
varstart : VAR ;
groupcode : INT ;
value : INT | ANYCHARSEQ ;
WS : [ \t]+ -> skip ;
NL : '\r'? '\n' ;
HEADER : '0' NL 'SECTION' NL '2' NL 'HEADER' NL ;
ENDSEC : '0' NL 'ENDSEC' NL ;
VAR : '9' NL VARNAME NL ;
VARNAME : '$' LETTER (LETTER | DIGIT)* NL ;
INT : DIGIT+ NL ;
ANYCHARSEQ : ANYCHAR+ NL ;
fragment ANYCHAR : [\u0021-\u00FF] ;
fragment LETTER : [A-Za-z_] ;
fragment DIGIT : [0-9] ;
But obviously this fails when trying to parse the Integer 0
, since this is regarded as the group code 0
by the lexer, cause of the header
rule.
So now I'm clueless how to resolve my problem. Any help is highly appreciated.
EDIT
changed ANTLR grammar to include more lexer rules. Now the problem is that the lexer completely fails. The first input character is an INT
token instead of a part of the HEADER
token like I intended it to be... The reason for this is that removing whitespace with -> skip
will not work if it's inside a single token (see following example):
For input A B
(space between the two letters) the this grammar will work:
start : 'A' 'B' ;
WS : [ \t\r\n]+ -> skip ;
But this grammar will not work:
start : AB ;
AB : 'A' 'B' ;
WS : [ \t\r\n]+ -> skip ;