4
votes

I want to parse some C++ code, and as a guide I've been looking at the C lex/yacc definitions here: http://www.lysator.liu.se/c/ANSI-C-grammar-l.html and http://www.lysator.liu.se/c/ANSI-C-grammar-y.html

I understand the specifications of the tokens themselves, but not how they interact. eg. it's OK to have an operator such as = directly follow an identifier without intervening white space (ie. "foo="), but it's not OK to have a numerical constant immediately followed by an identifier (ie. 123foo). However, I don't see any way that such rules are represented.

What am I missing?... or is this lex/yacc too liberal in its acceptance of errors.

4
You had me at "I want to parse some C++ code".j_random_hacker

4 Answers

3
votes

The lexer converts a character stream into a token stream (I think that's what you mean by token specification). The grammar specifies what sequences of tokens are acceptable. Hence, you won't see that something is not allowed; you only see what is allowed. Does that make sense?

EDIT

If the point is to get the lexer to distinguish the sequence "123foo" from the sequence "123 foo" one way is to add a specification for "123foo". Another way is to treat spaces as significant.

EDIT2

A syntax error can be "detected" from the lexer or the grammar production or the later stages of the compiler (think of, say, type errors, which are still "syntax errors"). Which part of the whole compilation process detects which error is largely a design issue (as it affects the quality of error messages), I think. In the given example, it probably makes more sense to outlaw "123foo" via a tokenization to an invalid token rather than relying on the non-existence of a production with a numeric literal followed by an identifier (at least, this is the behaviour of gcc).

1
votes

The lexer is fine with 123foo and will split that into two tokens.

  • An integer constant
  • and an identifier.

But try and find the part in the syntax that allows those two tokens to sit side by side like that. Thus I bet the lexer is generating an error when it sees these two tokens.

Note the lexer does not care about whitespace (unless you explicitly tell it tow worry). In this case it just throws white space away:

[ \t\v\n\f]     { count(); } // Throw away white space without looking.

Just to check this is what I built:

wget http://www.lysator.liu.se/c/ANSI-C-grammar-l.html > l.l
wget http://www.lysator.liu.se/c/ANSI-C-grammar-y.html > y.y

Edited file l.l to stop in the compiler complaining about undeclared functions:

#include "y.tab.h"

// Add the following lines
int  yywrap();
void count();
void comment();
void count();
int  check_type();
// Done adding lines

%}

Create the following file: main.c:

#include <stdio.h>

extern int yylex();

int main()
{
    int x;
    while((x = yylex()) != 0)
    {
        fprintf(stdout, "Token(%d)\n", x);
    }
}

Build it:

$ bison -d y.y
y.y: conflicts: 1 shift/reduce
$ flex l.l
$ gcc main.c lex.yy.c
$ ./a.out
123foo
123Token(259)
fooToken(258)

Yes it split it into two tokens.

0
votes

what's essentially going on is the lexical rules for each token type are greedy. For instance, the character sequence foo= cannot be interpreted as a single identifier, because identifiers don't contain symbols. on the other hand, 123abc is actually a numerical constant, though malformed, because numerical constants can end with a sequence of alphabetic characters that are used to express the type of the numerical constant.

0
votes

You won't be able to parse C++ with lex and yacc, as it's an ambiguous grammar. You'd need a more powerful approach such as GLR or some hackish solution which modifies a lexer in runtime (that's what most of the current C++ parsers are doing).

Take a look at Elsa/Elkhound.