flex and bison: parse string without quotes

Question

I'm working on a mgf file parser (syntax: http://www.matrixscience.com/help/data_file_help.html) using flex + bison c + +.

I've realized the lexer (lex) and parser (yacc). But I've a problem that I can't solve : when I try to parse strings.

Important : there is no ' or " around the string.

Here is an example of input:

CHARGE=1+, 2+ and 3+
#some comments

BEGIN IONS
TITLE= Cmpd 1, +MSn(417.2108), 10.0 min  //line 20
PEPMASS=417.21083   35173
CHARGE=3+
123.79550   20  
285.16455   56  
302.14335   146 1+
[other datas ...]
END IONS

BEGIN IONS
[an other one ... ]

Here the (minimal) lexer: MGF_TOKEN_DEBUG is juste a macro to print a line

#define MGF_TOKEN_DEBUG(val) std::cout<<"token: "<<val<<std::endl

\n {
    MGF_TOKEN_DEBUG("T_EOL");
    return token::T_EOL;
}

^[#;!/][^\n]* {
    MGF_TOKEN_DEBUG("T_COMMENT");
    return token::T_COMMENT;
}

[[:space:]] {}

/** values **/
[0-9]+ {
    MGF_TOKEN_DEBUG("V_INTEGER"<<" (="<<yytext<<")");
    return token::V_INTEGER;
}

[0-9]+"."[0-9]* {
   MGF_TOKEN_DEBUG("V_DOUBLE"<<" (="<<yytext<<")");
   return token::V_DOUBLE;
}

[0-9]+("."[0-9]+)?[eE][+-][0-9]+ {
    MGF_TOKEN_DEBUG("V_DOUBLE"<<" (="<<yytext<<")");
    return token::V_DOUBLE;
}

"+" {
    MGF_TOKEN_DEBUG("T_PLUS");
    return token::T_PLUS;
}


"=" {
    MGF_TOKEN_DEBUG("T_EQUALS");
    return token::T_EQUALS;
}

"," {
    MGF_TOKEN_DEBUG("T_COMA");
    return token::T_COMA;
}

"and" {
    MGF_TOKEN_DEBUG("T_AND");
    return token::T_AND;
}
/*** keywords */
^"CHARGE" {
    MGF_TOKEN_DEBUG("K_CHARGE");
    return token::K_CHARGE;
}

^"TITLE" {
    MGF_TOKEN_DEBUG("K_TITLE");
    return token::K_TITLE;
}
[ others keywords ...]

/**** string : problem here **/
[A-Za-z]([:;,()A-Za-z0-9_.-]|[[:space]])* {
    MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")");
    return token::V_STRING;
}

And the (minimized) parser.

start : headerparams blocks T_END;

headerparams : /* empty */| headerparams headerparam;

headerparam : K_CHARGE T_EQUALS charge_list T_EOL | [others ...];

blocks : /* empty */ | blocks block;

block : T_BEGIN_IONS T_EOL blockparams ions T_END_IONS T_EOL| T_BEGIN_IONS T_EOL blockparams T_END_IONS T_EOL;

blockparam  : K_CHARGE T_EQUALS charge T_EOL | K_TITLE T_EQUALS V_STRING T_EOL | [others...];

ion : number number  T_EOL| number number charge T_EOL;

ions : ions ion| ion;

number : V_INTEGER | V_DOUBLE;

charge : V_INTEGER T_PLUS | V_INTEGER T_MINUS;

charge_list : charge| charge_list T_COMA charge | charge_list T_AND charge;

My problem is that I get the next token:

[...]
[line 20]
token: K_TITLE
token: T_EQUALS
token: v_STRING (= Cmpd)
token: V_INTEGER (= 1)
Error line 20: syntax error, unexpected integer, expecting end of line

I would like to have:

[...]
[line 20]
token: K_TITLE
token: T_EQUALS
token: v_STRING (Cmpd 1, +MSn (417.2108), 10.0 min)
token: T_EOL

If someone can help me ...

Edit #1 I've "solve" the problem using the concatenation of tokens:

lex:

[A-Za-z][^\n[:space:]+-=,]* {
    MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")"))
    return token::V_STRING;t
}

yacc:

   string_st : V_STRING
      | string_st V_STRING
      | string_st number
      | string_st T_COMA
      | string_st T_PLUS
      | string_st T_MINUS
      ;

blockparam  : K_CHARGE T_EQUALS charge T_EOL | K_TITLE T_EQUALS string_st T_EOL | [others...];

The appropriate tag is flex-lexer, as flex is something Adobe related. — crashmstr

unbound unbound · Accepted Answer · 2013-11-15T14:52:06

if your string will alway start with some text TITLE and end with some text \n (new line char)
I would suggest you to use start conditions,

%x IN_TITLE

"TITLE"        { /* return V_STRING of TITILE in c++ code */ BEGIN(IN_TITLE); }
<IN_TITLE>=    { /* return T_EQUALS in c++ code */; }
<IN_TITLE>"\n" { BEGIN(INITIAL); }
<IN_TITLE>.*   { MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")");return token::V_STRING; }

%x IN_TITLE defines the IN_TITLE state, and the pattern text TITLE will make it start. Once it's started, \n will have it go back to the initial state (INITIAL is predefined), and every other characters will just be consumed to V_STRING without any particular action.

flex and bison: parse string without quotes

2 Answers