0
votes

I'm trying to make a simple parser to analyze BNF grammars as the input using flex and bison with c++. I'm getting some compile time errors. I have searched in other questions with similar errors and have corrected my files to match theirs, I'm still getting the errors.

This is my lex.l

%{
#include <iostream>
#include <string>
#define tkerror -1

#include "sintactic.tab.h"

using namespace std;
extern int row =1;
int col=0;
%}

%option caseful
%option noyywrap
%option yylineno
%option c++

ignora " "|\t|\n
ID [a-zA-Z]([a-zA-Z0-9_])*

%%

{ignora}+       {;}
"terminal"      {col = col + strlen(yylval.cad); strcpy(yylval.cad, yytext); return tkterminal;}
";"             {col = col + strlen(yylval.cad); strcpy(yylval.cad, yytext); return tkptocma;}
","             {col = col + strlen(yylval.cad); strcpy(yylval.cad, yytext); return tkcma;}
"no"            {col = col + strlen(yylval.cad); strcpy(yylval.cad, yytext); return tkno;}
"iniciar"       {col = col + strlen(yylval.cad); strcpy(yylval.cad, yytext); return tkiniciar;}
"con"           {col = col + strlen(yylval.cad); strcpy(yylval.cad, yytext); return tkcon;}
"="             {col = col + strlen(yylval.cad); strcpy(yylval.cad, yytext); return tkasignar;}
".rule"         {col = col + strlen(yylval.cad); strcpy(yylval.cad, yytext); return tkrule;}
"|"             {col = col + strlen(yylval.cad); strcpy(yylval.cad, yytext); return tkor;}
"%"             {col = col + strlen(yylval.cad); strcpy(yylval.cad, yytext); return tksep;}
"EPSILON"       {col = col + strlen(yylval.cad); strcpy(yylval.cad, yytext); return tkeps;}
{ID}            {col = col + strlen(yylval.cad); strcpy(yylval.cad, yytext); return tkid;}
[\r\n]          {row++; col = 0;}
.           {return tkerror;}

%%

And this is my sintactic.y

%{
#include <iostream>
#include <string>
#include "lex.yy.cc"
using namespace std;

extern int row;
extern int yylineno;
extern int col;
extern char* yytext;

extern "C" int yylex();

int yyerror(const char* men)
{
   string output = yytext;
   std::cout<<"Error sintactico "<<output<<" linea "<<row<<" columna "<<col<<endl;

   return 0;
}

%}

%union{
    int entero;
    char cad [256];
}

%token<cad> TOK_EMPTY_LINE;

%token<cad> tkterminal;
%token<cad> tkptocma;
%token<cad> tkno;
%token<cad> tkiniciar;
%token<cad> tkcon;
%token<cad> tkasignar;
%token<cad> tkrule;
%token<cad> tkor;
%token<cad> tksep;
%token<cad> tkeps;
%token<cad> tkid;
%token<cad> tkcma;

%type<nodo> Lenguaje
%type<nodo> Area_Declaraciones;
%type<nodo> Area_NTInicial;
%type<nodo> Area_Gramatica;
%type<nodo> Lista_Declaraciones;
%type<nodo> Declaracion;
%type<nodo> Dec_Terminal;
%type<nodo> Dec_NoTerminal;
%type<nodo> Ids;
%type<nodo> Producciones;
%type<nodo> Produccion;
%type<nodo> Izquierda;
%type<nodo> Derecha;
%type<nodo> Id_Eps;

%%

Lenguaje: Area_Declaraciones tksep Area_NTInicial tksep Area_Gramatica;

Area_Declaraciones: Lista_Declaraciones;
Lista_Declaraciones: Declaracion Lista_Declaraciones
    | ;
Declaracion: Dec_Terminal
    | Dec_NoTerminal;
Dec_Terminal: tkterminal Ids tkptocma;
Dec_NoTerminal: tkno tkterminal Ids tkptocma;

Ids: tkid tkcma Ids
    | tkid;

Area_NTInicial: tkiniciar tkcon tkid tkptocma;

Area_Gramatica: Producciones;
Producciones: Produccion Producciones
    | ;
Produccion: Izquierda tkasignar Derecha tkptocma;
Izquierda: Ids tkrule;
Derecha: Id_Eps Derecha
    | Id_Eps
    | tkor Derecha;
Id_Eps: tkid
    | tkeps;

%%

Compiling through the console using

bison -d sintactic.y
flex lex.l
g++ sintactic.tab.c -lfl -o scanner.sh

The errors that I get are these:

/usr/lib/x86_64-linux-gnu/libfl_pic.a(libmain.o): On the function `main':(.text.startup+0x9): undefined reference to `yylex'

/tmp/ccatti3x.o: On the function `yyerror(char const*)': sintactic.tab.c:(.text+0x23): undefined reference to `yytext[abi:cxx11]'

/tmp/ccatti3x.o: On the function `yyparse()': sintactic.tab.c:(.text+0x409): undefined reference to `yylex()'

collect2: error: ld returned 1 exit status

I will add the actions to sintactic.y after I can compile these without errors. I've seen other examples with almost the same stuff as mine and they seem to compile fine. I'm not used to c++ or flex/bison so I don't really know where those errors may come from.

1
-o scanner.sh Seriously?user0042
Flex produces the lexer in a file that defaults to the name lex.yy.c. You'll need to compile that and link it with your scanner.Jerry Coffin
yytext is a char*, not a std::string.rici
user0042 would you explain why the use of -o scanner.sh is incorrect?Diego Cruz
If I use the command g++ sintactic.tab.c lex.yy.c to link the two files it trows a lot of errors for multiple declarations since sintactic.tab.c has #include lex.yy.c. Removing that line and linking the files via command produces the same errors I posted beforeDiego Cruz

1 Answers

3
votes
  1. Remove %option c++. It's not helping you at all. You can use C++ code, including C++ datatypes, except for the usual limitations on union members (which means that you cannot use anything with a non-trivial destructor as part of the semantic type union.) Using this option causes flex to generate a completely different API, which does not include the global function yylex. You can use this API if you want to -- it's documented in the flex manual -- but you won't find a lot of examples. Much easier is to use the standard C interface, which is, in fact, what you're trying to use.

  2. Change the declaration of yylex in the bison file from

    extern "C" int yylex();
    

    to

    int yylex();
    

    Declaring it as C changes the way its name is represented internally; if you declare a function as extern "C" in some C++ file, you must do so in all of them, including the one in which it is defined (in thus case, the lexical scanner.)

    You could add the declaration to the flex file by defining the YY_DECL macro but there is no point in this case. It is easier to let it be compiled as a C++ function.

  3. The extern in extern int row = 1; is not really necessary. If you declare int row = 1; at file-level, it will have global linkage (i.e., you can reference it from a different "translation unit" (source file).) You would normally use extern to indicate that an identifier is defined in a different translation unit; in that case, you don't initialize the identifier (since it is initialized in the source file in which it is defined.)

  4. Do not #include "lex.yy.cc" in your bison file. I've seen professors who recommend that practice in order to avoid teaching their students about multiple translation units in a C (or C++) project, but that's a dead-end. Learn how to do separate compilation; it will be described in any good C textbook. (Or just put the names of both the bison- and flex-generated C files in the compilation line.) If you take my advice in point 1, the default name of the flex-generated scanner will be lex.yy.c but don't rely on that: use the flex -o option to give a meaningful names to the generated file (with a .cc extension if you want to compile it as C++).

  5. Using yytext in your bison actions is not usually a good idea, but if you are going to use it (for example, in an error report), make sure you declare it as char*.

  6. Don't use strlen(yytext) in your flex actions. Flex helpfully puts the length of the token in yyleng so you don't have to rescan to figure out how long it is.

  7. Large fixed-length buffers in your semantic type (char cad[250]) are really not a good idea. (And small fixed-length buffers are even worse.) First, they blow up the size of the parser stack, since every stack slot has to have room for the buffer. Also, the entire buffer will be copied every time you copy a stack slot, which is a waste of cycles. Most importantly, fixed-length buffers are just asking for trouble: see buffer overrun. Dynamically allocating memory to hold a copy of the token is easy. (strdup(yytext) is sufficient on most modern systems. Although strdup is not required by the standard C library, it is required by Posix.) The tricky part is knowing when to free() the allocated memory, but it should be obvious when you're writing your actions.

  8. Save yourself a lot of thinking about dynamic allocation and freeing of strings by not saving the names of tokens such as "iniciar". You know that tkiniciar corresponds to the string "iniciar", but your parser will never consult the semantic value of the token. You should only need semantic values for identifier tokens and literal constants (and in the case of numeric constants, you could use strtod or similar to produce an integer instead of passing the string to bison.)

  9. In flex, you can write [[:space:]] if you want to match a single whitespace character. (In addition to space, newline and tab, it will also match \f and \v, but that shouldn't be a problem.) You can also use [[:alpha:]], [[:digit:]] and [[:alnum:]]. (These are actual character classes, so you can add more things to the class. For example, [[:alnum:]_] is a letter, a digit, or an underscore.) You don't need to parenthesize character classes in order to repeat them; an identifier pattern could be [[:alpha:]_][[:alnum:]_]*. See the flex manual section on patterns for more details.

  10. Remove #define tkerror -1. Returning a negative integer from yylex to a bison parser is Undefined Behaviour, and will not do what you want. Instead, declare tkerror to be a %token (with no semantic type), so that yylex will be able to return it. (The result will be a syntax error because no production uses the tkerror token.)

    How you name your tokens and non-terminals is, of course, completely up to you, but normal style is to use ALL_CAPS for tokens and lower_case for non-terminals. (Some people like to capitalize, as you do, but most of us use lower-case.) Since tokens (but not non-terminals) end up being identifiers in the generated code, they must conform to C naming rules (or C++ rules, as appropriate), and they cannot collide with other identifier names generated by the code (which includes, for example, start conditions in the flex file, and flex macros like BEGIN). So it is sometimes useful to prefix token names with something like TOK_; in particular, TOK_BEGIN and TOK_END are pretty common. TOK_EMPTY_LINE seems unnecessary to me, and in any case your flex file never generates a token with that name, so you might as well just eliminate it as well.