Conventional wisdom is: Don't try to do semantic analysis in your grammar.
First, it complicates the grammar, even if it is possible, as you have seen. By contrast, the type-checking rules are very simple when performed in a tree walk over the AST.
Second, it is not really possible. Since your language is dynamic, you don't really know what the type of any variable is. So a compile-time check could result in three cases, not two: good, bad, and unknown. That will be even more complicated in the grammar, but only slightly more complicated in the semantic analysis.
However, depending on the precise nature of your language, it might be possible to choose a middle ground. Generally, some operators -- boolean operators and comparisons --- definitely return boolean values, while some contexts definitely require boolean values. So you could add a boolean_expression
non-terminal, used to indicate where results will definitely be boolean and where values must be boolean. Then you could insert into your grammar a single unit production
boolean_expression: expression
with a semantic action which inserts a runtime check node into the AST.
During semantic analysis, this check can be eliminated if it is determined that it would always succeed or an error produced if it is determined that it would always fail. Otherwise, code will eventually be emitted to do the check.
The advantage of this solution is that the grammar then shows the contexts in which a boolean is required, without suffering from the Byzantine modifications necessary to fully enforce the requirement.
(In the examples below, I only show one boolean operator, one comparison operator, and one arithmetic operator. Obviously a real language would have more of each, but it does not change the presentation at all. I also didn't bother with the prologue, which must include precedence declarations for the operators.)
program : stmt_list
stmt_list:%empty
| stmt_list stmt
stmt : assign
| call
| empty
| while
| '{' stmt_list '}'
assign : IDENTIFIER '=' expr ';'
call : expr '(' expr_list ')' ';'
| expr '(' ')' ';'
empty : ';'
while : "while" '(' boolean_expr ')' stmt
expr_list
: expr
| expr_list ',' expr
boolean_expr
: boolean_term
| boolean_expr "or" boolean_expr
| expr '<' expr
boolean_term
: "true" | "false"
| expr { /* insert conversion from expr to boolean */ }
expr : term
| expr '+' expr
term : INTEGER
| IDENTIFIER
| '(' expr ')'
But it does place some restrictions on the language. In the simplest incarnation as presented above, a boolean value can never be used other than in a boolean context, which prevents boolean values from being first-class values. They can't be used as the right-hand side of an assignment or as an argument in a function call, for example, as is clear from the above grammar.
Also, the above grammar doesn't allow redundant parentheses around boolean expressions.
That's not very satisfactory, but we can do better by separating boolean results from boolean values at the cost of a slight complication in the grammar.
In most languages, boolean values can be created according to defined rules from other values; by convention, a value which is converted to a boolean true
is called a "truthy" value. This can be very convenient, although it can also be slightly dangerous if there is too much latitude in the nature of the coercion. [Note 1] To control the damage, we might only allow automatic coercion to boolean in an explicitly boolean context, and never allow a boolean to be automatically coerced to a non-boolean. If you are willing to accept those restrictions, then we can still use the grammar as a tool for documenting boolean contexts and coercions.
In the following, we introduce four non-terminals, all representing some flavour of expression:
expr
: a non-boolean expression
boolean_expr
: a specifically boolean expression; the corresponding productions list the syntaxes which must have a boolean result.
truthy_expr
: either a boolean expression or a non-boolean expression which could be coerced to a boolean expression. This non-terminal is used in places where a boolean value is required. [Note 2]
either_expr
: either a boolean expression or a non-boolean expression in a context in which either might appear without coercion (assignments and function arguments, for example).
Note that the last two non-terminals have exactly the same productions, but very different semantics (and thus different semantic actions). Because the contexts in which they might appear are disjoint, no conflict results.
Other than the definition of the above non-terminals and their use in various contexts, the grammar is not much changed:
program : stmt_list
stmt_list:%empty
| stmt_list stmt
stmt : assign
| call
| empty
| while
| '{' stmt_list '}'
assign : IDENTIFIER '=' either_expr ';'
call : expr '(' expr_list ')' ';'
| expr '(' ')' ';'
empty : ';'
while : "while" '(' truthy_expr ')' stmt
expr_list
: either_expr
| expr_list ',' either_expr
truthy_expr
: boolean_expr
| expr { /* insert conversion from expr to boolean */ }
either_expr
: boolean_expr
| expr
boolean_expr
: boolean_term
| truthy_expr "or" truthy_expr
| expr '<' expr
boolean_term
: "true"
| "false"
| '(' boolean_expr ')'
expr : term
| expr '+' expr
term : INTEGER
| IDENTIFIER
| '(' expr ')'
If you believe the above is too complicated, then go with the conventional wisdom and avoid semantics interspersed in your grammar. If, on the other hand, you feel that it has expository value and your language is such that the restrictions are acceptable, then adapt it to your purposes.
Notes:
The scheme does not depend on the existence of any "truthy" coercion, but if boolean values are first class, there will be expressions which can only be determined to be boolean at runtime (boolean variables, functions returning boolean values, etc.). Consider the run-time check that a value used in a boolean context is a boolean value to be a form of coercion into truthiness, where only true
is true and only false
is false, while all other values throw an error.
Personally, I've grown fond of limited automatic boolean coercions. It makes perfect sense to me that a file object be falsy if it is in an error condition, for example, or that a container be truthy if it is non-empty. Restricting these conversions to explicitly boolean contexts, such as the condition in a conditional statement, makes the automatic coercion acceptable to me. But I don't insist on the idea; if you don't like it, ignore the thought.
This isn't a very good name, but truthy_or_falsy_expr
seemed too long and boolish_expr
seemed too weird. Suggestions are welcome.
if("notABool")
be a runtime error anyway? What aboutsomeVar = "notABool"; if(someVar)
, which is already allowed by your current grammar? How are you handling that? – sepp2kif ("string")
could already be rejected at the parsing stage, or we could say that a string literal creates an object which cannot be used in a boolean context and throw an error at runtime. I don't really know which approach is better, that's partly why I'm asking this question. – user4520