1
votes

I am writing a parser for delphi's dfm's files. The lexer looks like this:

EXP ([Ee][-+]?[0-9]+)

%%

("#"([0-9]{1,5}|"$"[0-9a-fA-F]{1,6})|"'"([^']|'')*"'")+ { 
                                                 return tkStringLiteral; }
"object" { return tkObjectBegin; }
"end" { return tkObjectEnd; }
"true" { /*yyval.boolean = true;*/ return tkBoolean; }
"false" { /*yyval.boolean = false;*/ return tkBoolean; }

"+" | "." | "(" | ")" | "[" | "]" | "{" | "}" | "<" | ">" | "=" | "," | 
":" { return yytext[0]; }

[+-]?[0-9]{1,10} { /*yyval.integer = atoi(yytext);*/ return tkInteger; }
[0-9A-F]+ { return tkHexValue; }
[+-]?[0-9]+"."[0-9]+{EXP}? { /*yyval.real = atof(yytext);*/ return tkReal; }
[a-zA-Z_][0-9A-Z_]* { return tkIdentifier; }
"$"[0-9A-F]+ { /* yyval.integer = atoi(yytext);*/ return tkHexNumber; }

[ \t\r\n] { /* ignore whitespace */ }
. { std::cerr << boost::format("Mystery character %c\n") % *yytext; }

<<EOF>> { yyterminate(); }

%%

and the bison grammar looks like

%token tkInteger
%token tkReal
%token tkIdentifier
%token tkHexValue
%token tkHexNumber
%token tkObjectBegin
%token tkObjectEnd
%token tkBoolean
%token tkStringLiteral

%%object:
    tkObjectBegin tkIdentifier ':' tkIdentifier 
          property_assignment_list tkObjectEnd
  ;

property_assignment_list:
    property_assignment
  | property_assignment_list property_assignment
  ;

property_assignment:
    property '=' value
  | object
  ;

property:
    tkIdentifier
  | property '.' tkIdentifier
  ;

value:
    atomic_value
  | set
  | binary_data
  | strings
  | collection
  ;

atomic_value:
    tkInteger
  | tkReal
  | tkIdentifier
  | tkBoolean
  | tkHexNumber
  | long_string
  ;

long_string:
    tkStringLiteral
  | long_string '+' tkStringLiteral
  ;

atomic_value_list:
    atomic_value
  | atomic_value_list ',' atomic_value
  ;

set:
    '[' ']'
  | '[' atomic_value_list ']'
  ;

binary_data:
    '{' '}'
  | '{' hexa_lines '}'
  ;

hexa_lines:
    tkHexValue
  | hexa_lines tkHexValue
  ;

strings:
    '(' ')'
  | '(' string_list ')'
  ;

string_list:
    tkStringLiteral
  | string_list tkStringLiteral
  ;

collection:
    '<' '>'
  | '<' collection_item_list '>'
  ;

collection_item_list:
    collection_item
  | collection_item_list collection_item
  ;

collection_item:
    tkIdentifier property_assignment_list tkObjectEnd
  ;

%%

void yyerror(const char *s, ...) {...}

The problem with this grammar occurs while parsing the binary data. Binary data in the dfm's files is nothing but a sequence of hexadecimal characters which never spans more than 80 characters per line. An example of it is:

Picture.Data = {
      055449636F6E0000010001002020000001000800A80800001600000028000000
      2000000040000000010008000000000000000000000000000000000000000000

      ...

      FF00000000000000000000000000000000000000000000000000000000000000
      00000000FF000000FF000000FF00000000000000000000000000000000000000
      00000000}

As you can see, this element lacks any markers, so the strings clashes with other elements. In the example above the first line is returns the proper token tkHexValue. The second however returns a tkInteger token and the third a tkIdentifier token. So when the parsing comes, it fails with an syntax error because binary data is composed only of tkHexValue tokens.

My first workaround was to require integers to have a maximum length (which helped in all but the last line of the binary data). And the second was to move the tkHexValue token above the tkIdentifier but it means that now I will not have identifiers like F0

I was wondering if there is any way to fix this grammar?

1

1 Answers

1
votes

Ok, I solved this one. I needed to define a state so tkHexValue is only returned while reading binary data. In the preamble part of the lexer I added

%x BINARY

and modify the following rules

"{" {BEGIN BINARY; return yytext[0];}
<BINARY>"}" {BEGIN INITIAL; return yytext[0];}
<BINARY>[ \t\r\n] { /* ignore whitespace */ }

And that was all!