ANTLR Distinguish DXF group codes and integers

Question

I'm failry new to ANTLR and I'm trying to write a parser for DXF files with ANTLRv4. DXF files use so called group codes to specify the type of the following data.

Example excerpt from some DXF file:

  0
SECTION
  2
HEADER
  9
$ORTHOMODE
 70
     0
  9
  0
ENDSEC

For example the first 0 means that in the next line a String follows. The group code 70 means that an 16Bit Integer will follow, in the example it's a 0. My problem now is e.g. how can distinguish between the group code 0 and Integer 0. In the example snippet it seems that Integer values have some special indentation, but I couldn't find anything about this in the DXF reference.

My idea so far was following ANTLR grammar:

grammar SimpleDXF;

start       :   HEADER variable* ENDSEC ;
variable    :   varstart (groupcode NL value NL)+ ;
varstart    :   VAR ;
groupcode   :   INT ;
value       :   INT | ANYCHARSEQ ;

WS          :   [ \t]+ -> skip ;  
NL          :   '\r'? '\n' ;
HEADER      :   '0' NL 'SECTION' NL '2' NL 'HEADER' NL ;
ENDSEC      :   '0' NL 'ENDSEC' NL ;
VAR         :   '9' NL VARNAME NL ;
VARNAME     :   '$' LETTER (LETTER | DIGIT)* NL ;
INT         :   DIGIT+ NL ;
ANYCHARSEQ  :   ANYCHAR+ NL ;

fragment ANYCHAR    :   [\u0021-\u00FF] ;
fragment LETTER     :   [A-Za-z_] ;
fragment DIGIT      :   [0-9] ;

But obviously this fails when trying to parse the Integer 0, since this is regarded as the group code 0 by the lexer, cause of the header rule.

So now I'm clueless how to resolve my problem. Any help is highly appreciated.

EDIT

changed ANTLR grammar to include more lexer rules. Now the problem is that the lexer completely fails. The first input character is an INT token instead of a part of the HEADER token like I intended it to be... The reason for this is that removing whitespace with -> skip will not work if it's inside a single token (see following example):

For input A B (space between the two letters) the this grammar will work:

start   :   'A' 'B' ;
WS      :   [ \t\r\n]+ -> skip ;

But this grammar will not work:

start   :   AB ;
AB      :   'A' 'B' ;
WS      :   [ \t\r\n]+ -> skip ;

schauk11erd schauk11erd · Accepted Answer · 2014-05-26T15:58:29

I've solved the problem by doing some preprocessing, where every group code and it's corresponding value are on the same line. The preprocessing also eliminates leading and trailing whitespaces as @UweAllner suggested. The example input file from the question after preprocessing looks like this:

0 SECTION
2 HEADER
9 $ORTHOMODE
70 0
0 ENDSEC

Like this its easily possible to distinguish group codes and simple integers, cause group codes are always at the start of a line, while integers are at the end of a line. The following example grammar solves the problem:

grammar SimpleDXF;

start           :   HEADER variable* ENDSEC ;
variable        :   varstart groupcodevalue+ ;
varstart        :   VAR ;
groupcodevalue  :   GROUPCODE value ;
value           :   (INT | ANYCHARSEQ) NL ;

NL              :   '\r'? '\n' ;
HEADER          :   '0 SECTION' NL '2 HEADER' NL ;
ENDSEC          :   '0 ENDSEC' NL ;
VAR             :   '9 ' VARNAME NL ;
GROUPCODE       :   INT ' ' ;
VARNAME         :   '$' LETTER (LETTER | DIGIT)* ;
INT             :   '-'? DIGIT+ ;
ANYCHARSEQ      :   ANYCHAR+ ;

fragment ANYCHAR:   [\u0021-\u00FF] ;
fragment LETTER :   [A-Za-z_] ;
fragment DIGIT  :   [0-9] ;

ANTLR Distinguish DXF group codes and integers

2 Answers