antlr grammar: Lexer matches “impossible” rule

Question

I got this parser grammar with which I also want to use something similar to Javascript template-strings.

parser grammar Test;

options {
  tokenVocab = TestLexer;
}

definition: sourceElements? EOF ;

sourceElements: sourceElement+ ;

sourceElement: mapping ;


templateString: '`' TemplateStringCharacter* ('${' variable '}' TemplateStringCharacter*)+ '`' ;
fieldName: varname | ('[' value ']') ;
mapping: fieldName ':' ( '{' sourceElements '}'
      | variable ( '{' sourceElements '}' )? '?'?
      | value
      | array )
      ;

funParameter: '(' value? (',' value)*  ')' ;
array: '[' value? (',' value)* ']';
variable: (varname | '{' value '}' | '[' boolEx ']' | templateString) funParameter? ('.' variable)* ;
value: INT | BOOL | FLOAT | STRING | variable ;
varname: VAR ;

And this lexer grammar

lexer grammar TestLexer;

WS : [ \t\r\n\u000C]+ -> skip ;
NEWLINE : [\r\n] ;
BOOL : ('true'|'false') ;
TemplateStringLiteral : TemplateStringCharacter*;
VAR : [$]?[a-zA-Z0-9_]+|[@] ;
INT : '-'?[0-9]+ ;
FLOAT : '-'?[0-9]+'.'[0-9]+ ;
STRING : '"' DoubleStringCharacter* '"' | '\'' SingleStringCharacter* '\'' ;
TEMPSTART : '${' ;
TEMPEND : '}' ;

TemplateStart : '`' -> pushMode(template) ;

/// Comments
MultiLineComment : '/*' .*? '*/' -> channel(HIDDEN) ;
SingleLineComment : '//' ~[\r\n\u2028\u2029]* -> channel(HIDDEN) ;

mode template;
TemplateVariableStart: TEMPSTART -> pushMode(templateVariable);
TemplateStringLiteral : TemplateStringCharacter* ;
TemplateEnd : '`' -> popMode;

mode templateVariable;
WS : [ \t\r\n\u000C]+ -> skip ;
All : [^}]+ ;
TemplateVariableEnd : TEMPEND -> popMode;

fragment DoubleStringCharacter : ~["\r\n] ;
fragment SingleStringCharacter : ~['\r\n] ;
fragment TemplateStringCharacter : ~[`] ;
fragment DecimalDigit : [0-9] ;

When I input this:

test: {
  abc: `Hello World`
}

The parsing tree looks like this:

(definition 
  (sourceElements 
    (sourceElement 
      (statement 
        (mapping 
          (fieldName 
            (varname test)
          ) : { 
          (sourceElements
            (sourceElement
              (statement mapping)
            ) 
            (sourceElement
              (statement
                (mapping abc : `)
              )
            ) 
            (sourceElement 
              (statement mapping)
            ) 
            (sourceElement 
              (statement 
                (mapping Hello)
              )
            ) 
            (sourceElement 
              (statement
                (mapping World `)
              )
            )
          ) 
          }
        )
      )
    )
  ) 
  <EOF>
)

And I get the error: line 2:8 no viable alternative at input 'abc:`Hello'

I don't understand, why it is even possible to match something like an empty mapping or a mapping like "World `" because a mapping would need to have a ":" in the middle. And why is the rule templateString not matching the whole "Hello World" from back tick to back tick?

EDIT:

After noticing that the Lexer wasn't regenerated when I thought it was I got errors like: "cannot create implicit token for string literal in non-combined grammar: ']'". So I had to move all implicit declarations to the lexer grammar. So I changed the code to this:

parser grammar Test;

options {
  tokenVocab = TestLexer;
}

definition: sourceElements? EOF ;

sourceElements: sourceElement+ ;

sourceElement: mapping ;

templateString: OpenBackTick TemplateStringLiteral* (TemplateVariableStart variable CloseBrace TemplateStringLiteral*)+ CloseBackTick ;
fieldName: varname | OpenBracket value CloseBracket ;
mapping: fieldName Colon (
      OpenBrace sourceElements CloseBrace
      | variable ( OpenBrace sourceElements CloseBrace )? IF?
      | value
      | array
    )
    ;

funParameter: OpenParen value? (Comma value)* CloseParen ;
array: OpenBracket value? (Comma value)* CloseBracket;
variable: (varname | OpenBrace value CloseBrace | templateString) funParameter? (Dot variable)* ;
value: INT | BOOL | FLOAT | STRING | variable ;
varname: VAR ;

And lexer grammar:

lexer grammar TestLexer;

OpenBracket: '[';
CloseBracket: ']';
OpenParen: '(';
CloseParen: ')';
OpenBrace: '{' ;
CloseBrace: '}' ;
IF: '?' ;
AND: 'AND' ;
OR: 'OR';
LessThan: '<';
MoreThan: '>';
LessThanEquals:   '<=';
GreaterThanEquals:   '>=';
Equals: '=';
NotEquals: '!=';
IN: 'IN';
NOT: '!';
Colon: ':';
Dot: '.' ;
Comma: ',' ;
OpenBackTick : '`' -> pushMode(template) ;

WS : [ \t\r\n\u000C]+ -> skip ;
NEWLINE : [\r\n] ;
BOOL : ('true'|'false') ;
VAR : [$]?[a-zA-Z0-9_]+|[@] ;
INT : '-'?[0-9]+ ;
FLOAT : '-'?[0-9]+'.'[0-9]+ ;
STRING : '"' DoubleStringCharacter* '"' | '\'' SingleStringCharacter* '\'' ;

/// Comments
MultiLineComment : '/*' .*? '*/' -> channel(HIDDEN) ;
SingleLineComment : '//' ~[\r\n\u2028\u2029]* -> channel(HIDDEN) ;

mode template;
TemplateVariableStart: '${' -> pushMode(templateVariable);
CloseBackTick : '`' -> popMode;
TemplateStringLiteral: TemplateStringCharacter ;

mode templateVariable;
WHS : [ \t\r\n\u000C]+ -> skip ;
All : [^}]+ ;
TemplateVariableEnd : CloseBrace -> popMode;

fragment DoubleStringCharacter : ~["\r\n] ;
fragment SingleStringCharacter : ~['\r\n] ;
fragment TemplateStringCharacter : ~[`] ;
fragment DecimalDigit : [0-9] ;

Now I get the error: line 1:0 mismatched input 'test' expecting {, '?', '[', VAR} Which is strange, cause 'test' should be matched by VAR. Any ideas why this is happening?

This is the point where we really need to look at which tokens the lexer generates for your input. Looking at your lexer grammar, I agree that test should be a VAR, but clearly the lexer does not, so it'd be important to know what the lexer thinks test is. With your old code it would have been a TemplateStringLiteral (except that that would have matched more than just test), but with your current code I don't see anything else that matches. Try to run your lexer with antlr4 TestLexer.g4 && javac *.java && grun TestLexer tokens -tokens or iterate over the token stream in JavaScript. — sepp2k
In java does it work. In JS still not and I don't know how to output the same tokens as "grun TestLexer tokens -tokens" does. The documentation is not that good for JS. I tried this.tokenStream = new antlr4.CommonTokenStream(this.lexer); But this.tokenStream.tokens only returned a list like: [ CommonToken { source: [ [UpsatLexer], [InputStream] ], type: 3, channel: 0, start: 0, stop: 3, tokenIndex: 0, line: 1, column: 0, _text: null }, ...]. I tried some other functions, but not all java-functions seem to be present in JS. — Martin Cup
You can get the tokens via lexer.getAllTokens() and then you can print each one properly by calling its toString with the lexer as an argument. See this gist — sepp2k
Have you rebuilt the parser after rebuilding the lexer? Maybe the parser has a different idea of which numbers correspond to which tokens because it hasn't been updated since the last time you changed the lexer? That's the only thing I can think of if 26 is the number for VAR. — sepp2k
Yeah, the part about passing the lexer was nonsense (you can get rule names for parse trees by passing the parser to the tree's toString, but that doesn't work for tokens), you can use lexer.ruleNames[tok.type] to actually get the token type as a string, but if the numbers match that will just print VAR as well. — sepp2k

sepp2k sepp2k · Accepted Answer · 2019-06-24T17:31:56

There are two lexer rules in your default mode that can match a backtick: BTICK and TemplateStart. TemplateStart will switch to the template mode, but BTICK will not. Since BTICK comes first in your grammar, so it takes precedence. That means when the lexer sees a backtick, it will generate a BTICK token and not switch modes.

To fix this you should have only one lexer rule per mode that matches a backtick and that rule should change the mode.

I don't understand, why it is even possible to match something like an empty mapping or a mapping like "World `" because a mapping would need to have a ":" in the middle.

When your input contains a syntax error, the generated parse tree can contain constructs that aren't actually valid either. When your input parses without errors, you'll get a tree that makes sense.

antlr grammar: Lexer matches “impossible” rule

1 Answers