0
votes

I am new to Flex/Bison. I am trying to write a parser for a simple programming language with support for generic types.

I would like to parse a line like this :

fn foo(Vector<Pair<int, Array<T>>) -> void {}

I can imagine how to write a hand-made parser for Vector<Pair<int, Array<T>>. I would just keep track of the number of < that I encounter and match that with the number of > that I encounter to determine if the type specification is complete.

For the type, the grammar specification would be something like this I believe?

TYPE : ID | ID '<' TYPE '>'
     ;

I am not sure if TYPE is a token that is produced by Flex or by Bison.

My understanding is:

  • ID is a token from the Parser (Flex)
  • TYPE is a `Term' (not a Token) defined in Bison.
  • Flex and Bison automatically ignore whitespace and tabs.

Am I going in the right direction?

NOTE: This project is just for my education purposes. Not a homework assignment etc.

1

1 Answers

0
votes

You are in the right direction, but it is useful to learn some of the correct terminology. Knowing the terminology will help you understand and use the text books which contain much more helpful information.

In the rule TYPE the name ID would be known as a terminal symbol, whereas TYPE would be called a non-terminal symbol of the grammar. A grammar is the set of rules used to describe the language. Each grammar rule defines a non-terminal symbol. Ultimately each non-terminal will be described as an arrangement of terminal symbols in a unique (non ambiguous) arrangement.

A terminal symbol is represented by a token which stands for whatever is its concrete representation. A concrete representation is what is actually typed on the keyboard. The sequence of characters that make up the concrete representation is known as a lexeme. The matching of the lexemes to create the tokens is the task performed by the lexical analyser (or scanner). The matching of the sequence of tokens to the non-terminal in the grammar rules is known as parsing.

Flex is the tool that produces the lexical analyser and bison is the tool that produces the parser.

So, TYPE is not a token but a non-terminal symbol (otherwise known a grammar rule name). It is not "produced" by flex or bison, but by the writer of the grammar. The parser produced by bison will reduce the sequences of terminals and non-terminals to the non-terminal called TYPE.

To avoid confusion between tokens, terminals, non-terminals and rules there is an unwritten convention that tokens or terminal symbols are written either in CAPS or as character constants. For example:

ID, '<'

Non-terminals or syntax rules are written in lower case to avoid the confusion with the former. For example:

type, expression

So, an experienced user of bison might write:

type_signature : ID 
     | ID '<' type_signature '>'
     ;

That way the nature of each name is much clearer.

Now to discuss whitespace. No, flex and bison do not ignore (automatically or otherwise) whitespace, tabs, newlines, carriage returns and other non visible characters, but it's complicated (as is your question).

There are the whitespace characters that occur in the concrete representation handled by the lexical analyser, and there are the whitespace characters that occur in the sets of rules that describe the language processing to the tools flex and bison. A language that you wish to process may contain syntactically (or even semantically) significant whitespace characters (an extreme example is the language called WhiteSpace). It is possible to write the parser for that language using flex and bison, so it cannot ignore all whitespace in its input description file or the input langauge! Rather than go into too much details here, it is just worth noting that in certain places whitespace is ignored and in other places it is particularly significant; until one is more experienced you should be careful. Whitespace is more sensitive in the lexer rules file than the parser rules file.