0
votes

I'm failry new to ANTLR and I'm trying to write a parser for DXF files with ANTLRv4. DXF files use so called group codes to specify the type of the following data.

Example excerpt from some DXF file:

  0
SECTION
  2
HEADER
  9
$ORTHOMODE
 70
     0
  9
  0
ENDSEC

For example the first 0 means that in the next line a String follows. The group code 70 means that an 16Bit Integer will follow, in the example it's a 0. My problem now is e.g. how can distinguish between the group code 0 and Integer 0. In the example snippet it seems that Integer values have some special indentation, but I couldn't find anything about this in the DXF reference.

My idea so far was following ANTLR grammar:

grammar SimpleDXF;

start       :   HEADER variable* ENDSEC ;
variable    :   varstart (groupcode NL value NL)+ ;
varstart    :   VAR ;
groupcode   :   INT ;
value       :   INT | ANYCHARSEQ ;

WS          :   [ \t]+ -> skip ;  
NL          :   '\r'? '\n' ;
HEADER      :   '0' NL 'SECTION' NL '2' NL 'HEADER' NL ;
ENDSEC      :   '0' NL 'ENDSEC' NL ;
VAR         :   '9' NL VARNAME NL ;
VARNAME     :   '$' LETTER (LETTER | DIGIT)* NL ;
INT         :   DIGIT+ NL ;
ANYCHARSEQ  :   ANYCHAR+ NL ;

fragment ANYCHAR    :   [\u0021-\u00FF] ;
fragment LETTER     :   [A-Za-z_] ;
fragment DIGIT      :   [0-9] ;

But obviously this fails when trying to parse the Integer 0, since this is regarded as the group code 0 by the lexer, cause of the header rule.

So now I'm clueless how to resolve my problem. Any help is highly appreciated.

EDIT

changed ANTLR grammar to include more lexer rules. Now the problem is that the lexer completely fails. The first input character is an INT token instead of a part of the HEADER token like I intended it to be... The reason for this is that removing whitespace with -> skip will not work if it's inside a single token (see following example):

For input A B (space between the two letters) the this grammar will work:

start   :   'A' 'B' ;
WS      :   [ \t\r\n]+ -> skip ;  

But this grammar will not work:

start   :   AB ;
AB      :   'A' 'B' ;
WS      :   [ \t\r\n]+ -> skip ;  
2

2 Answers

1
votes

I've solved the problem by doing some preprocessing, where every group code and it's corresponding value are on the same line. The preprocessing also eliminates leading and trailing whitespaces as @UweAllner suggested. The example input file from the question after preprocessing looks like this:

0 SECTION
2 HEADER
9 $ORTHOMODE
70 0
0 ENDSEC

Like this its easily possible to distinguish group codes and simple integers, cause group codes are always at the start of a line, while integers are at the end of a line. The following example grammar solves the problem:

grammar SimpleDXF;

start           :   HEADER variable* ENDSEC ;
variable        :   varstart groupcodevalue+ ;
varstart        :   VAR ;
groupcodevalue  :   GROUPCODE value ;
value           :   (INT | ANYCHARSEQ) NL ;

NL              :   '\r'? '\n' ;
HEADER          :   '0 SECTION' NL '2 HEADER' NL ;
ENDSEC          :   '0 ENDSEC' NL ;
VAR             :   '9 ' VARNAME NL ;
GROUPCODE       :   INT ' ' ;
VARNAME         :   '$' LETTER (LETTER | DIGIT)* ;
INT             :   '-'? DIGIT+ ;
ANYCHARSEQ      :   ANYCHAR+ ;

fragment ANYCHAR:   [\u0021-\u00FF] ;
fragment LETTER :   [A-Za-z_] ;
fragment DIGIT  :   [0-9] ;
0
votes

You are missing a rule like

group: groupcode NL value;

Otherwise (as you say) no distinction is possible between groupcodes and values as such. Or, if one groupcode may be followed by several values:

group: groupcode (NL value)+;

And you should define header and endsec as HEADER and ENDSEC to allow the lexer to distinguish between "just a number" and "is the start of a sequence". The same possibly for the start of the variable rule (and everything consisting of a fixed sentence).

EDIT: Something like

HEADER      :   '0' WS* NL WS* 'SECTION' WS* NL WS* '2' WS* NL WS* 'HEADER' WS* NL ;

comes spontaneously to my mind, while not being very elegant. But strange file formats require exotic measures.

To straighten this out a little, would it be possible for you to trim the lines of leading and trailing whitespace before they are lexed and parsed?