1
votes

I am trying to parse a file like this: (too simple for my actual purpose, but for the beginning, this is ok)

@Book{key2,
 Author="Some2VALUE" ,
 Title="VALUE2" 
}

The lexer is:

[A-Za-z"][^\\\"  \n\(\),=\{\}#~_]*      { yylval.sval = strdup(yytext); return KEY; }
@[A-Za-z][A-Za-z]+                 {yylval.sval = strdup(yytext + 1); return ENTRYTYPE;}
[ \t\n]                                ; /* ignore whitespace */
[{}=,]                                 { return *yytext; }
.                                      { fprintf(stderr, "Unrecognized character %c in input\n", *yytext); }

And then parsing this with:

%union
{
    char    *sval;
};

%token <sval> ENTRYTYPE
%type <sval> VALUE
%token <sval> KEY

%start Input

%%

Input: Entry
      | Input Entry ;  /* input is zero or more entires */

Entry: 
     ENTRYTYPE '{' KEY ','{ 
         b_entry.type = $1; 
         b_entry.id = $3;
         b_entry.table = g_hash_table_new_full(g_str_hash, g_str_equal, free, free);} 
     KeyVals '}' {
         parse_entry(&b_entry);
         g_hash_table_destroy(b_entry.table);
         free(b_entry.type); free(b_entry.id);
         b_entry.table = NULL;
         b_entry.type = b_entry.id = NULL;}
     ;

KeyVals: 
      /* empty */ 
      | KeyVals KeyVal ; /* zero or more keyvals */

VALUE:
      /*empty*/
      | KEY 
      | VALUE KEY 
      ;
KeyVal: 
      /*empty*/
      KEY '=' VALUE ',' { g_hash_table_replace(b_entry.table, $1, $3); }
      | KEY '=' VALUE  { g_hash_table_replace(b_entry.table, $1, $3); }
      | error '\n' {yyerrok;}
      ;

There are few problem, so that I need to generalize both the lexer and parser: 1) It can not read a sentence, i.e. if the RHS of Author="Some Value", it only shows "Some. i.e. space is not handled. Dont know how to do it. 2) If I enclose the RHS with {} rather then "", it gives syntax error. Looking for help for this 2 situation.

1

1 Answers

1
votes

The main issue is that your tokens are not appropriate. You should try to recognize the tokens of your example as follows:

@Book        ENTRYTYPE
{            '{'
key2         KEY
,            ','
Author       KEY
=            '='
"Some2VALUE" VALUE
,            ','
Title        KEY
=            '='
"VALUE2"     VALUE
}            '}'

The VALUE token could for example be defined as follows:

%x value
%%
"\""           {BEGIN(value);}
<value>"\""    {BEGIN{INITIAL); return VALUE;}
<value>"\\\""  { /* escaped " */ }
<value>[^"]    { /* Non-escaped char */ }

Or in a single expression as

"\""([^"]|("\\\""))*"\""

This is assuming that only " needs to be escaped with a \. I'm not sure how BibTeX defines how to escape a ", if possible at all.