2
votes

I'm creating a grammar in Bison for a simple dynamically-typed language. I have a "general" expression rule, which is somewhat akin to the concept of an rvalue in C; expressions appear on the right-hand side of an assignment, they can also be sent to functions as arguments etc. A greatly simplified version of the rule follows:

constantExpression
    : TOK_INTEGER_CONSTANT
    | TOK_FLOAT_CONSTANT
    | stringLiteral
    ;

expression
    : constantExpression
    | identifier
    | booleanExpression
    | booleanExpression TOK_QUESTION_MARK expression TOK_COLON expression
    | TOK_LPAREN expression TOK_RPAREN
    | expression TOK_PLUS expression
    | expression TOK_MINUS expression
    ;

I also have a dedicated boolean expression rule; boolean expressions are used most prominently in if statements, but any other context which requires a binary truth value is of course also fine:

booleanExpression
    : identifier
    | expression '<' expression
    | expression '<=' expression
    | expression '==' expression
    | expression '!=' expression
    | expression '>' expression
    | expression '>=' expression
    | booleanExpression '&&' booleanExpression
    | booleanExpression '||' booleanExpression
    | '!' booleanExpression
    | 'true'
    | 'false'
    ;

The problem: obviously the above rules suffer from a wealth of reduce-reduce conflicts. An identifier, depending on context, should be reduced either to an expression (for example, in a statement such as this: myVar2 = myVar1), or to a booleanExpression (obvious example: if (myBoolVar)).

Not only that, but there are also shift-reduce errors related to the fact that a booleanExpresssion reduces to an expression; when the parser has parsed a booleanExpression, and it encounters a && token, it could either shift and keep going, or reduce to an expression. A booleanExpression needs to reduce to an expression to allow code such as

conditionVar = (var1 == 5) || (var2 > 10 && var3 < 20);
if (conditionVar) { ... }

I'm aware of shift-reduce conflicts related to operator precedence, that isn't a problem here, I've fixed it using %left rules for the operators.

My question: what is the best solution to this problem? My ideas are to either

  1. structure the rules in such a way that despite the conflicts, Bison does what I want, and ignore the warnings - quite ugly and unmaintanable
  2. remove the separate booleanExpression rule and move it all to expression - okay, but requires more checks at the semantic analysis stage; in my language, strings are not implicitly convertible to booleans, so code like if ("foo") is not legal, but would be accepted by the parser
  3. use a GLR parser - feels a bit overkill for such a simple use-case, but perhaps I'm wrong

Which of the above is the best idea? Are there any other possibilities I haven't considered?

1
If it's dynamically typed, shouldn't something like if("notABool") be a runtime error anyway? What about someVar = "notABool"; if(someVar), which is already allowed by your current grammar? How are you handling that?sepp2k
@sepp2k Yes, the second example you gave is indeed a runtime error. if ("string") could already be rejected at the parsing stage, or we could say that a string literal creates an object which cannot be used in a boolean context and throw an error at runtime. I don't really know which approach is better, that's partly why I'm asking this question.user4520
Usually dynamically typed languages don't produce any type errors statically, even in cases where they can be detected easily. They might produce warnings though (however, most of the time that's handled by external tools, not the language implementation itself). Either way if you do perform static type checking (be it complete or incomplete, errors or warnings), that's the job of the semantic analysis phase, not the parser.sepp2k

1 Answers

6
votes

Conventional wisdom is: Don't try to do semantic analysis in your grammar.

First, it complicates the grammar, even if it is possible, as you have seen. By contrast, the type-checking rules are very simple when performed in a tree walk over the AST.

Second, it is not really possible. Since your language is dynamic, you don't really know what the type of any variable is. So a compile-time check could result in three cases, not two: good, bad, and unknown. That will be even more complicated in the grammar, but only slightly more complicated in the semantic analysis.

However, depending on the precise nature of your language, it might be possible to choose a middle ground. Generally, some operators -- boolean operators and comparisons --- definitely return boolean values, while some contexts definitely require boolean values. So you could add a boolean_expression non-terminal, used to indicate where results will definitely be boolean and where values must be boolean. Then you could insert into your grammar a single unit production

 boolean_expression: expression

with a semantic action which inserts a runtime check node into the AST.

During semantic analysis, this check can be eliminated if it is determined that it would always succeed or an error produced if it is determined that it would always fail. Otherwise, code will eventually be emitted to do the check.

The advantage of this solution is that the grammar then shows the contexts in which a boolean is required, without suffering from the Byzantine modifications necessary to fully enforce the requirement.

(In the examples below, I only show one boolean operator, one comparison operator, and one arithmetic operator. Obviously a real language would have more of each, but it does not change the presentation at all. I also didn't bother with the prologue, which must include precedence declarations for the operators.)

program : stmt_list
stmt_list:%empty
        | stmt_list stmt
stmt    : assign
        | call
        | empty
        | while
        | '{' stmt_list '}'
assign  : IDENTIFIER '=' expr ';'
call    : expr '(' expr_list ')' ';'
        | expr '(' ')' ';'
empty   : ';'
while   : "while" '(' boolean_expr ')' stmt
expr_list
        : expr
        | expr_list ',' expr
boolean_expr
        : boolean_term
        | boolean_expr "or" boolean_expr
        | expr '<' expr
boolean_term
        : "true" | "false"
        | expr               { /* insert conversion from expr to boolean */ }

expr    : term
        | expr '+' expr
term    : INTEGER
        | IDENTIFIER
        | '(' expr ')'

But it does place some restrictions on the language. In the simplest incarnation as presented above, a boolean value can never be used other than in a boolean context, which prevents boolean values from being first-class values. They can't be used as the right-hand side of an assignment or as an argument in a function call, for example, as is clear from the above grammar.

Also, the above grammar doesn't allow redundant parentheses around boolean expressions.

That's not very satisfactory, but we can do better by separating boolean results from boolean values at the cost of a slight complication in the grammar.

In most languages, boolean values can be created according to defined rules from other values; by convention, a value which is converted to a boolean true is called a "truthy" value. This can be very convenient, although it can also be slightly dangerous if there is too much latitude in the nature of the coercion. [Note 1] To control the damage, we might only allow automatic coercion to boolean in an explicitly boolean context, and never allow a boolean to be automatically coerced to a non-boolean. If you are willing to accept those restrictions, then we can still use the grammar as a tool for documenting boolean contexts and coercions.

In the following, we introduce four non-terminals, all representing some flavour of expression:

  • expr: a non-boolean expression
  • boolean_expr: a specifically boolean expression; the corresponding productions list the syntaxes which must have a boolean result.
  • truthy_expr: either a boolean expression or a non-boolean expression which could be coerced to a boolean expression. This non-terminal is used in places where a boolean value is required. [Note 2]
  • either_expr: either a boolean expression or a non-boolean expression in a context in which either might appear without coercion (assignments and function arguments, for example).

Note that the last two non-terminals have exactly the same productions, but very different semantics (and thus different semantic actions). Because the contexts in which they might appear are disjoint, no conflict results.

Other than the definition of the above non-terminals and their use in various contexts, the grammar is not much changed:

program : stmt_list
stmt_list:%empty
        | stmt_list stmt
stmt    : assign
        | call
        | empty
        | while
        | '{' stmt_list '}'
assign  : IDENTIFIER '=' either_expr ';'
call    : expr '(' expr_list ')' ';'
        | expr '(' ')' ';'
empty   : ';'
while   : "while" '(' truthy_expr ')' stmt
expr_list
        : either_expr
        | expr_list ',' either_expr

truthy_expr
        : boolean_expr
        | expr               { /* insert conversion from expr to boolean */ }

either_expr
        : boolean_expr
        | expr

boolean_expr
        : boolean_term
        | truthy_expr "or" truthy_expr
        | expr '<' expr
boolean_term
        : "true"
        | "false"
        | '(' boolean_expr ')'

expr    : term
        | expr '+' expr
term    : INTEGER
        | IDENTIFIER
        | '(' expr ')'

If you believe the above is too complicated, then go with the conventional wisdom and avoid semantics interspersed in your grammar. If, on the other hand, you feel that it has expository value and your language is such that the restrictions are acceptable, then adapt it to your purposes.


Notes:

  1. The scheme does not depend on the existence of any "truthy" coercion, but if boolean values are first class, there will be expressions which can only be determined to be boolean at runtime (boolean variables, functions returning boolean values, etc.). Consider the run-time check that a value used in a boolean context is a boolean value to be a form of coercion into truthiness, where only true is true and only false is false, while all other values throw an error.

    Personally, I've grown fond of limited automatic boolean coercions. It makes perfect sense to me that a file object be falsy if it is in an error condition, for example, or that a container be truthy if it is non-empty. Restricting these conversions to explicitly boolean contexts, such as the condition in a conditional statement, makes the automatic coercion acceptable to me. But I don't insist on the idea; if you don't like it, ignore the thought.

  2. This isn't a very good name, but truthy_or_falsy_expr seemed too long and boolish_expr seemed too weird. Suggestions are welcome.