0
votes

The initial title question was: Why does my lexer rule not work, until I change it to a parser rule? The contents below are related to this question. Then I found new information and changed the title question. Please see my comment!

My Antlr Grammar (Only the "Spaces" rule and it's use is important):

grammar MyTest;

Space:        ' ';
Tab:        '\t';
Break:         '\n';
Digit:        [0-9];
Char:        [A-Z\u00C4\u00D6\u00DCa-z\u00E4\u00F6\u00FC\u00DF];
Prefix:        '"' | '\'' | '(' | '[';
Suffix:        '\u00AF' | '\u002d' | '.' | ',' | ':' | ';' | '!' | '?' | '"' | '\'' | ')' | ']';
Special:    [\u005e\u00ac\u2014\u201e\u2022/><ยง&{}#*~+\\];

Spaces:        Space (Space Space?)?;
Sign: Prefix | Suffix | Special ;

LatinNumber
    : 'I' ('I' 'I'?)?  
    | 'I'? 'V' ('I' ('I' 'I'?)?)?  
    | 'I'? 'X' ('I' ('I' 'I'?)?)? 'V'? ('I' ('I' 'I'?)?)? ;
YearNumber
    : '(' '1' '9' Digit Digit ')'
    | '[' '1' '9' Digit Digit ']'
    | '1' '9' Digit Digit;
OtherNumber
    : [1-9] Digit* ;

Numbers
    : LatinNumber | YearNumber | OtherNumber;
NormalNumbers
    : Prefix? Numbers Suffix?;  

Word: Prefix? Char Char+ Suffix?;

line: Break Spaces? ((Word | NormalNumbers) Spaces?)+ ;

myTest: line ;

Example Input:

Something- and Somethingmore at location

Located Somewhere

Dallas, 2012

at. 99.2013(2014)

Some bla blub Text- and Content Examples from Wikipedia The Illinois Centennial half dollar is a commemorative fifty-cent piece struck by the United States Bureau of the Mint in 1918. The obverse side, depicting Abraham Lincoln, was designed by Chief Engraver George T. Morgan; the reverse image, based on the Seal of Illinois, was done by his assistant and successor, John R. Sinnock.

https://en.wikipedia.org/wiki/Illinois_Centennial_half_dollar

Console Output

line 2:10 extraneous input ' ' expecting {<EOF>, NormalNumbers, Word}
ParseTree:
(myTest (line \n Something-   and))

Improved ParseTree:
'- myTest
 |- TOKEN[type: 3, text: \n]
 |- TOKEN[type: 16, text: Something-]
 |- TOKEN[type: 1, text:  ]
 '- TOKEN[type: 16, text: and]

So the output states there is a problem right after the first "Something-" of my input where the whitespace is coming - in my grammar just called Space. Because my input comes from an ocr source there can be multiple whitespaces, but on the other hand i need to recognize the spaces, because they have meaning for the text structure. For this reason in my grammar I defined

Spaces:        Space (Space Space?)?;

but this throws the error above - the whitespace is not recognzied. So when I replace it with a parser rule (lowercase!) in my grammar

spaces:        Space (Space Space?)?;

and also here

line: Break spaces? ((Word | NormalNumbers) spaces?)+ ;

the error seems to be solved (subsequent errors appear - not part of this question).

So why is the error solved then in this concrete case when using a parser rule instead of a lexer rule? And in general - when to use a lexer rule and when a parser rule?

Thank you, guys!

1
When I completely reverse the order of my grammar rules, it seems to work then even if I am using a lexer rule for "Space". This led me to the new title question you can see above. So, when i define all parser rules first, below I can define a lexer rule for "Space" and the parser uses it as expected - but not vice versa - why? - dc.
Please don't edit your question after it has been answered. If you have a different question, ask a different question. (But in this case, I think my answer basically stands. Do you find the answer unconvincing?) - rici
@rici Thanks, i have not seen your answer before. I guess you hit the point. The reason was a double matching, which also explains the behavior that it works when completly reversing the rule definitions, because then "Spaces" comes first. So it is a question of order, but not in this way I had expected. My error was trivial - I have not seen it. Thank you - dc.
I tried to make the answer more explicit. Hope it helps. - rici
ah ok, related to your updated answer there is still more complexity than i have thought. Apart from the lexical ordering it will change the game to redefine the lexer rule to a parser rule, because it will kind of overwrite the lexer rule then. But if we stay on the lexer level the ordering is crucial. - dc.

1 Answers

0
votes

A single space is being recognized as a Space and not as a Spaces, since it matches both lexical rules and Space comes first in the grammar file. (You can see that token type 1 is being recognized; Spaces would be type 9 by my count.)

Antlr uses the common "maximum munch" lexical strategy in which the lexical token recognized corresponds to the longest possible match, ordering the possibilities by order in the file in case two patterns match the same longest match. When you put Spaces first in the file, it wins the tie rule. If you make it a parser rule instead of a lexical rule, then it gets applied after the unambiguous lexical rule for Space.

Do you really only want to allow up to 3 spaces? Otherwise, you could just ditch Space and define Spaces as " "*.