1
votes

I'm working on a mgf file parser (syntax: http://www.matrixscience.com/help/data_file_help.html) using flex + bison c + +.

I've realized the lexer (lex) and parser (yacc). But I've a problem that I can't solve : when I try to parse strings.

Important : there is no ' or " around the string.

Here is an example of input:

CHARGE=1+, 2+ and 3+
#some comments

BEGIN IONS
TITLE= Cmpd 1, +MSn(417.2108), 10.0 min  //line 20
PEPMASS=417.21083   35173
CHARGE=3+
123.79550   20  
285.16455   56  
302.14335   146 1+
[other datas ...]
END IONS

BEGIN IONS
[an other one ... ]

Here the (minimal) lexer: MGF_TOKEN_DEBUG is juste a macro to print a line

#define MGF_TOKEN_DEBUG(val) std::cout<<"token: "<<val<<std::endl

\n {
    MGF_TOKEN_DEBUG("T_EOL");
    return token::T_EOL;
}

^[#;!/][^\n]* {
    MGF_TOKEN_DEBUG("T_COMMENT");
    return token::T_COMMENT;
}

[[:space:]] {}

/** values **/
[0-9]+ {
    MGF_TOKEN_DEBUG("V_INTEGER"<<" (="<<yytext<<")");
    return token::V_INTEGER;
}

[0-9]+"."[0-9]* {
   MGF_TOKEN_DEBUG("V_DOUBLE"<<" (="<<yytext<<")");
   return token::V_DOUBLE;
}

[0-9]+("."[0-9]+)?[eE][+-][0-9]+ {
    MGF_TOKEN_DEBUG("V_DOUBLE"<<" (="<<yytext<<")");
    return token::V_DOUBLE;
}

"+" {
    MGF_TOKEN_DEBUG("T_PLUS");
    return token::T_PLUS;
}


"=" {
    MGF_TOKEN_DEBUG("T_EQUALS");
    return token::T_EQUALS;
}

"," {
    MGF_TOKEN_DEBUG("T_COMA");
    return token::T_COMA;
}

"and" {
    MGF_TOKEN_DEBUG("T_AND");
    return token::T_AND;
}
/*** keywords */
^"CHARGE" {
    MGF_TOKEN_DEBUG("K_CHARGE");
    return token::K_CHARGE;
}

^"TITLE" {
    MGF_TOKEN_DEBUG("K_TITLE");
    return token::K_TITLE;
}
[ others keywords ...]

/**** string : problem here **/
[A-Za-z]([:;,()A-Za-z0-9_.-]|[[:space]])* {
    MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")");
    return token::V_STRING;
}

And the (minimized) parser.

start : headerparams blocks T_END;

headerparams : /* empty */| headerparams headerparam;

headerparam : K_CHARGE T_EQUALS charge_list T_EOL | [others ...];

blocks : /* empty */ | blocks block;

block : T_BEGIN_IONS T_EOL blockparams ions T_END_IONS T_EOL| T_BEGIN_IONS T_EOL blockparams T_END_IONS T_EOL;

blockparam  : K_CHARGE T_EQUALS charge T_EOL | K_TITLE T_EQUALS V_STRING T_EOL | [others...];

ion : number number  T_EOL| number number charge T_EOL;

ions : ions ion| ion;

number : V_INTEGER | V_DOUBLE;

charge : V_INTEGER T_PLUS | V_INTEGER T_MINUS;

charge_list : charge| charge_list T_COMA charge | charge_list T_AND charge;

My problem is that I get the next token:

[...]
[line 20]
token: K_TITLE
token: T_EQUALS
token: v_STRING (= Cmpd)
token: V_INTEGER (= 1)
Error line 20: syntax error, unexpected integer, expecting end of line

I would like to have:

[...]
[line 20]
token: K_TITLE
token: T_EQUALS
token: v_STRING (Cmpd 1, +MSn (417.2108), 10.0 min)
token: T_EOL

If someone can help me ...


Edit #1 I've "solve" the problem using the concatenation of tokens:

lex:

[A-Za-z][^\n[:space:]+-=,]* {
    MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")"))
    return token::V_STRING;t
}

yacc:

   string_st : V_STRING
      | string_st V_STRING
      | string_st number
      | string_st T_COMA
      | string_st T_PLUS
      | string_st T_MINUS
      ;

blockparam  : K_CHARGE T_EQUALS charge T_EOL | K_TITLE T_EQUALS string_st T_EOL | [others...];
2
The appropriate tag is flex-lexer, as flex is something Adobe related.crashmstr

2 Answers

1
votes

if your string will alway start with some text TITLE and end with some text \n (new line char)
I would suggest you to use start conditions,

%x IN_TITLE

"TITLE"        { /* return V_STRING of TITILE in c++ code */ BEGIN(IN_TITLE); }
<IN_TITLE>=    { /* return T_EQUALS in c++ code */; }
<IN_TITLE>"\n" { BEGIN(INITIAL); }
<IN_TITLE>.*   { MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")");return token::V_STRING; }

%x IN_TITLE defines the IN_TITLE state, and the pattern text TITLE will make it start. Once it's started, \n will have it go back to the initial state (INITIAL is predefined), and every other characters will just be consumed to V_STRING without any particular action.

1
votes

Your basic problem is a simple typo:

[A-Za-z]([:;,()A-Za-z0-9_.-]|[[:space]])*

should be:

[A-Za-z]([:;,()A-Za-z0-9_.-]|[[:space:]])*
                                     ^

You don't actually need the | operator. The following is perfectly legal (but probably not what you want either; see below):

[A-Za-z][[:space:]:;,()A-Za-z0-9_.-]*

Once you fix that, you'll find that you have another problem: your keywords (TITLE, for example) will be lexed as STRING because the STRING pattern is longer. (In fact, since [:space:] includes \n, the STRING pattern will probably extend to the end of the input. You probably wanted [:blank:].)

I took a quick glance at the description of the format you're trying to parse, but it's not a very precise description. But it appears that parameter lines have the format:

^[[:alpha:]]+=.*$

Perhaps the :alpha: should be :alnum: or even something more permissive; as I said, the description wasn't very precise. What was clear is that:

  • The keyword is case-insensitive, so both TITLE and title will work identically, and
  • The = sign is obligatory and may not have a space on either side of it. (So your TITLE= line is not correct, but maybe it doesn't matter).

In order to not interfere with parsing of the data, you might want to make the above a single "token" whose value is the part after the = and whose type corresponds to the (case-normalized) keyword. Of course, each parameter-type may require an idiosyncratic value parser, which could only be achieved in flex by use of start conditions. In any event, you should think about the consequences of stray characters in the TITLE which are not part of the STRING pattern, and how you propose to deal with the resulting lexical error.


Your code does not make it clear how you communicate text values from your lexer to your parser. You need to be aware that the value of yytext is only safe inside of the lexer action for the token it corresponds to. The next call to the lexer will invalidate it, and bison parsers almost always have a lookahead token, so the lexer will have been called again before the token is processed. Consequently, you must copy yytext in order to pass it to the parser, and the parser needs to take ownership of the copy so that you don't end up leaking memory.