Parsing optional semicolon at statement end

Question

I was writing a parser to parse C-like grammars.

First, it could now parse code like:

a = 1;
b = 2;

Now I want to make the semicolon at the end of line optional.

The original YACC rule was:

stmt: expr ';' { ... }

Where the new line is processed by the lexer that written by myself(the code are simplified):

rule(/\r\n|\r|\n/)          { increase_lineno(); return :PASS }

the instruction :PASS here is equivalent to return nothing in LEX, which drop current matched text and skip to the next rule, just like what is usually done with whitespaces.

Because of this, I can't just simply change my YACC rule into:

stmt: expr end_of_stmt { ... }
;
end_of_stmt: ';'
    | '\n'
;

So I chose to change the lexer's state dynamically by the parser correspondingly.

Like this:

stmt: expr { state = :STATEMENT_END } ';' { ... }

And add a lexer rule that can match new line with the new state:

rule(/\r\n|\r|\n/, :STATEMENT_END) { increase_lineno(); state = nil; return ';' }

Which means when the lexer is under :STATEMENT_END state. it will first increase the line number as usual, and then set the state into initial one, and then pretend itself is a semicolon.

It's strange that it doesn't actually work with following code:

a = 1
b = 2

I debugged it and got it is not actually get a ';' as expect when scanned the newline after the number 1, and the state specified rule is not really executed.

And the code to set the new state is executed after it already scanned the new line and returned nothing, that means, these works is done as following order:

scan a, = and 1
scan newline and skip, so get the next value b
the inserted code({ state = :STATEMENT_END }) is executed
raising error -- unexpected b here

This is what I expect:

scan a, = and 1
found that it matches the rule expr, so reduce into stmt
execute the inserted code to set the new lexer state
scan the newline and return a ; according the new state matching rule
continue to scan & parse the following line

After introspection I found that might caused as YACC uses LALR(1), this parser will read forward for one token first. When it scans to there, the state is not set yet, so it cannot get a correct token.

My question is: how to make it work as expected? I have no idea on this.

Thanks.

Chris Dodd Chris Dodd · Accepted Answer · 2012-06-10T19:16:56

The first thing to recognize is that having optional line terminators like this introduces ambiguity into your language, and so you first need to decide which way you want to resolve the ambiguity. In this case, the main ambiguity comes from operators that may be either infix or prefix. For example:

a = b
-c;

Do you want to treat the above as a single expr-statement, or as two separate statements with the first semicolon elided? A similar potential ambiguity occurs with function call syntax in a C-like language:

a = b
(c);

If you want these to resolve as two statements, you can use the approach you've tried; you just need to set the state one token earlier. This gets tricky as you DON'T want to set the state if you have unclosed parenthesis, so you end up needing an additional state var to record the paren nesting depth, and only set the insert-semi-before-newline state when that is 0.

If you want to resolve the above cases as one statement, things get tricky, as you actually need more lookahead to decide when a newline should end a statement -- at the very least you need to look at the token AFTER the newline (and any comments or other ignored stuff). In this case you can have the lexer do the extra lookahead. If you were using flex (which you're apparently not?), I would suggest either using the / operator (which does lookahead directly), or defer returning the semicolon until the lexer rule that matches the next token.

In general, when doing this kind of token state recording, I find it easiest to do it entirely within the lexer where possible, so you don't need to worry about the extra token of lookahead sometimes (but not always) done by the parser. In this specific case, an easy approach would be to have the lexer record the parenthesis seen (+1 for (, -1 for )), and the last token returned. Then, in the newline rule, if the paren level is 0 and the last token was something that could end an expression (ID or constant or ) or postfix-only operator), return the extra ;

An alternate approach is to have the lexer return NEWLINE as its own token. You would then change the parser to accept stmt: expr NEWLINE as well as optional newlines between most other tokens in the grammar. This exposes the ambiguity directly to the parser (its now not LALR(1)), so you need to resolve it either by using yacc's operator precedence rules (tricky and error prone), or using something like bison's %glr-parser option or btyacc's backtracking ability to deal with the ambiguity directly.

Parsing optional semicolon at statement end

2 Answers