0
votes

I'm facing a problem with what looks like simple grammar:

grammar Test;

init :
    init separator init
    | word;

word :
    ( LETTER )+ ;

separator :
    SPACE OPERATOR SPACE
    | SPACE ;

SPACE : ' '+ ;
LETTER : 'A'..'Z' ;
OPERATOR : 'AND' | 'OR' ;

WS : [\t\r\n]+ -> skip ; // skip spaces, tabs, newlines

If I input the string AOR OR B what I get is an line 1:1 extraneous input 'OR' expecting {, SPACE, LETTER} but I don't understand why, because the word should match any capital letter until find a space char, isn't it?

The result what I expect is to catch the word AOR, the OR operator and the word B.

Can anyone give me some tips?, thank you in advance!

2

2 Answers

1
votes

In your case, the input AOR OR B gets tokenized as follows:

  1. type=WORD, text=A
  2. type=OR, text=OR
  3. type=SPACE, text=
  4. type=OR, text=OR
  5. type=SPACE, text=
  6. type=WORD, text=B

If you want AOR to be tokenized as a single word, you should make it a lexer rule instead of a parser rule:

WORD : 'A'..'Z'+ ;
0
votes
  • don't mix lexical and syntax analysis. word should be a token, the way you defined it with a grammar rule allows WS to appear inside a word.

  • why is ' ' different from \t \r \n? has it a special meaning for your grammar? if you define WS as [ \t\r\n] -> skip your tokens will be separated by those chars, and they will be ignored.

  • use an unambigous grammar. parser generators may resolve ambigouities but the correctness of the result will depend on the grammar and the tool you use. you must know how the generator resolves it.

    init : init separator init | word
    

    can be equivalently and unambigously expressed as

    init : word init2;
    init2 : separator word init2 | ;
    

    or

    init : word (separator word)*