0
votes

I've got a simple file format that I want to parse with the jison parser generator. This file can consist of multiple expressions in arbitrary order and quantity. Here is the jison file for the parser:

/* lexical grammar */
%lex

%%

\s+                   /* skip whitespace */
\"(\\.|[^"])*\"          return 'STRING'

File\s*Version\s*\:      return 'FILEVERSION'
[0-9]+("."[0-9]+)?\b     return 'NUMBER'
<<EOF>>                  return 'EOF'
.                        return 'INVALID'

/lex

%start expressions

%% /* language grammar */

expressions
    : EOF
    | e expressions EOF
    ;

e
    : STRING
    | FILEID 
    ;

FILEID
    : FILEVERSION NUMBER { return $1 + $2; }
    ;

For simplicity, I've shortened the file to have only strings and file id expressions.

My problem is, that the generated parser seems to recognize only one complete expression or two, if the second expression consists of only one token like strings. For example:

File Version: 1.0

Will be parsed, or

File Version: 1.0 "My String"

Will also be parsed, but for

File Version: 1.0 "My String" "Not parsed string"

the last string won't be parsed.

I've tried this code with jison debugger and on the jison page itself but both pages are showing the same results.

My suggestions about the problem are:

  1. Some lexer errors (regex)
  2. Some grammar error (left-right-recursion)
  3. Some action missing in parser (kind of { $$ = $1;} )
  4. Some other bison/jison magic I'm missing

I'm not that ebnf-parser-guru so please keep your answers as simple as possible.

1

1 Answers

3
votes

The immediate problem is that you return from the FILEID production. return returns, so the parse terminates with the returned value. Normally, semantic rules should provide their result by assigning to the variable $$. (That's not necessary for "unit rules" where there is only one symbol on the right-hand side; before the action is executed, the parser does $$ = $1, so if that's what you want, you can just leave the action out as you do in your two FILEID rules.)

In addition, your expressions production doesn't do anything with $2, so even if you fixed the first problem, you would still only see one e in the result.

Your expressions production is also incorrect, since it requires one EOF token for every e, in addition to EOF from the base case. Consider how the productions work:

expressions -> e expressions EOF
            -> e e expressions EOF EOF
            -> e e e expressions EOF EOF EOF
            -> e e e EOF EOF EOF EOF

Personally, I'd suggest using left-recursion instead of right-recursion. Bottom-up parsers like jison prefer left-recursion, and it usually leads to more natural semantic rules, as in this case.

Finally, you need to return the final value when you actually reach the end of the input. In jison, that usually requires an explicit start rule whose semantic action is return.

So, with all that in mind, let's try this: (I changed the names of some of the non-terminals, and down-cased FILEID because it is conventional to use lower-case for non-terminals and upper-case for terminals)

%start prog
%%
prog   : exprs EOF          { return $1; }
       ;
exprs  :                    { $$ = []; }
       | exprs expr         { $$.push($2); }
       ;
expr   : file_id
       | STRING
       ;
file_id: FILEVERSION NUMBER { $$ = $1 + $2; }
       ;

One note about your regular expression for matching strings:

\"(\\.|[^"])*\"          return 'STRING'

Although it apparently works in javascript (mostly; see below), it will exhibit a bug in flex (or a Posix-compatible regular expression library). It mostly works in javascript because the javascript regular expression alternation operator | is ordered; if the first alternative matches, the second alternative is never tried unless the rest of the pattern fails to match (and in that case, the bug will be triggered).

But in (f)lex, the alternation operator notices all matching alternatives, and eventually chooses the longest possible match. The result is that when matching "\\"...", flex will match the token until the third quote, by using [^"] to match the first \ and then \\. to match the \". That lets it continue looking for a closing quote.

It's easy to write the regex so that it will work with either semantic, and I'd strongly suggest doing that in case you ever want to migrate to a different parser generator, by simply ensuring that \ is not matched by [^"]:

\"(\\.|[^\\"])*\"        return 'STRING'

This change will also fix the subtle bug, even in javascript, that "\" is considered a valid string token if it is the last string in the input. In this case, javascript will first use \\. to match \", but once it does that it will not find any closing quote. It will then backtrack and try matching with [^"], which will match the \, allowing the quote to then be recognized as the closing quote.