0
votes

I'm wanting to work on a toy language with a Flex/Bison-generated parser with bindings to Rust. For simplicity, I'd like to have Bison simply call a Rust-defined createCstNode() function for each matched rule to create a concrete syntax tree (which will be translated into an AST in Rust for further processing). The function call will need to include the matched rule type as an argument so that my Rust code will know what type of node it is (expression, if-statement, while-statement, function call, literal string, number, etc.).

From looking through the generated Bison parser, it looks like there is a variable yyn which seems to be an integer representing the matched rule, though I haven't seen it documented anywhere. I'm aware that the %defines option will give me an enum of tokens in parser.tab.h, but I need both terminal and non-terminal symbols enumerated. I also have seen the %token-table option, which gives non-terminal symbols as well, but isn't quite what I need either, and also goes straight into the parser.tab.c file rather than the parser.tab.h file, which makes using something like rust-bindgen more difficult.

So is there any way to have Bison generate an enum similar to the yytokentype enum, which includes non-terminal symbols, and is placed into a header file? Or am I stuck with defining my own enum for the CST node types which match up with the symbols I have? Is yyn documented anywhere? Is it safe to use as a way of identifying the rule which was matched in an action? Is there some better way I can be going about this?

1

1 Answers

0
votes

yyn is not documented anywhere (not even in generated code comments) and I personally would not recommend using it. When bison gets around to executing an action, yyn is the action number, or if you like, the number of the production whose reduction triggers the action. Since non-terminals can (and usually do) have multiple productions, it is not the number corresponding to the non-terminal.

You can see the difference if you use the -v option to print the state transitions. At the beginning of that file, you'll find both the production list and the non-terminal list. Here's a simple example:

Grammar

    0 $accept: prog $end

    1 prog: expr
    2     | prog ';' expr

    3 expr: NUMBER
    4     | '(' expr ')'
    5     | expr '+' expr
    6     | expr '-' expr
    7     | expr '*' expr
    8     | expr '/' expr

(This is followed by the list of terminals, which I omitted.)

Nonterminals, with rules where they appear

$accept (11)
    on left: 0
prog (12)
    on left: 1 2, on right: 0 2
expr (13)
    on left: 3 4 5 6 7 8, on right: 1 2 4 5 6 7 8

Here, you can see that there are two user-defined terminals, prog with token id 12 and expr with id 13. There are nine user-defined productions, numbered from 1 to 8. It is only coincidence that these number ranges don't overlap; token codes less than 11 have been used for the (renumbered) terminal symbols as well as some internal tokens.

No enums are produced for the non-terminal token ids nor for the production numbers. The bison parser doesn't require such enums, and it is not at all clear what a user program might choose to do with them, particularly the production numbers.


I think you're approaching this problem the wrong way. If I understand correctly what you are trying to do, you want to construct a specific function for each production which will be implemented in Rust and also requires a C prototype. That seems like a simple code generation problem which you could produce starting with the grammar itself.

It's not complicated to extract the grammar from the original bison source, but if you don't want to go to the trouble of doing so, you can easily use the listing from the .output file as shown above, which can be parsed with pretty well any simple text-processing tool like awk.

With a bit of work, and making some use of undocumented features, you can also pull the grammar out of bison's debugging data structures.