Different results with simple antlr grammar using intellij vs command line

Question

I have a simple grammar which seems to work correctly in ItelliJ antlr4 plugin, but when run through antlr at the command line it produces some unusual errors

I have tried searching for similar problems and rearranging the lexer but have had no success.

the lexer is as follows

lexer grammar kscriptLexer;

TEXT_COMMENT
    :   '//' ~[\r\n]*?
    ;

TAG_COMMENT
    :   '<%//' .*? '%>'
    ;

TAG_OPEN
    :   '<%' //->pushMode(SCRIPT)
    ;

//TEXT
//  :   ~'<'+ ->skip
//  ;



//mode SCRIPT;

TAG_CLOSE
    :   '%>' //->popMode
    ;

IF  :   I F;
ENDIF   :   E N D I F;
ELSE    :   E L S E;
LOOP    :   L O O P;
ENDLOOP :   E N D L O O P;

INT
    :   [0-9]+
    ;

FLOAT
    :   [0-9]+ DOT [0-9]+
    ;

STRING
    :   '"' ~'"'* '"'
    ;

NE  :   '<>';
LE  :   '<=';
GE  :   '>=';
LT  :   '<';
GT  :   '>';
EQ  :   '=';
ASSIGN  :   ':=';
AMPERSAND:  '&';

MUL :   '*';
DIV :   '/';
ADD :   '+';
SUB :   '-';

LBKT    :   '(';
RBKT    :   ')';

COMMA   :   ',';
DOT :   '.';

fragment A: 'a' | 'A';
fragment B: 'b' | 'B';
fragment C: 'c' | 'C';
fragment D: 'd' | 'D';
fragment E: 'e' | 'E';
fragment F: 'f' | 'F';
fragment G: 'g' | 'G';
fragment H: 'h' | 'H';
fragment I: 'i' | 'I';
fragment J: 'j' | 'J';
fragment K: 'j' | 'K';
fragment L: 'l' | 'L';
fragment M: 'm' | 'M';
fragment N: 'n' | 'N';
fragment O: 'o' | 'O';
fragment P: 'p' | 'P';
fragment Q: 'q' | 'Q';
fragment R: 'r' | 'R';
fragment S: 's' | 'S';
fragment T: 't' | 'T';
fragment U: 'u' | 'U';
fragment V: 'v' | 'V';
fragment W: 'w' | 'W';
fragment X: 'x' | 'X';
fragment Y: 'y' | 'Y';
fragment Z: 'z' | 'Z';

ID
    :   [a-zA-Z_] [a-zA-Z_0-9]*
    ;

WS
    :   [ \t\r\n]+ -> channel(HIDDEN)
    ;

and the grammar is


tokens {TAG_COMMENT,TEXT_COMMENT,TAG_OPEN,TAG_CLOSE,IF,ENDIF,ELSE,LOOP,ENDLOOP,TEXT,ID,DOT,ASSIGN,LBKT,RBKT,INT,FLOAT,STRING,MUL,DIV,ADD,AMPERSAND,
        SUB,EQ,NE,LT,GT,GE,LE,COMMA}

start
    :   part+
    ;

part
//  :   TAG_COMMENT                                 # TagComment
    :   TEXT_COMMENT                                    # TextComment
    |   TAG_OPEN part1 TAG_CLOSE                            # PartA
    |   TEXT                                        # TextStmt
    ;

part1
        :   IF expr TAG_CLOSE part* TAG_OPEN ELSE TAG_CLOSE part* TAG_OPEN ENDIF        # IfElseStmt
        |   LOOP expr TAG_CLOSE part* TAG_OPEN ENDLOOP                  # LoopStmt
        |   stmt                                        # ScriptOpen
    ;

stmt
    :   ID (DOT ID)* ASSIGN expr                            # Assign
    |   ID (DOT ID)* LBKT params RBKT                           # Proc
    |   expr                                        # Expression
    ;

params
    :   expr (COMMA expr)*
    ;

expr
    :   expr MUL expr               # Mul
    |   expr DIV expr               # Div
    |   expr ADD expr               # Add
    |   expr AMPERSAND expr         # Ampersand
    |   expr SUB expr               # Sub
    |   expr EQ expr                # Eq
    |   expr NE expr                # Ne
    |   expr LT expr                # Lt
    |   expr GT expr                # Gt
    |   expr GE expr                # Ge
    |   expr LE expr                # Le
    |   ID (DOT ID)*                # Id
    |   ID (DOT ID)* LBKT params RBKT       # Func
    |   INT                 # Int
    |   FLOAT                   # Float
    |   STRING                  # String
    |   LBKT expr RBKT              # Expr1
    ;

my sample input is

<%if GlobalValue("operationtype") = "Add"%>
<%else%>
<%endif%>

I get the parse tree I expect from intellij, but I get the following from the command line

C:\Antlr\complex>set GRAMMAR=kscript
C:\Antlr\complex>set JAVAROOT=C:\Program Files\Java\jdk-11.0.1\bin
C:\Antlr\complex>"C:\Program Files\Java\jdk-11.0.1\bin\java.exe" -jar c:\batch\antlr-4.7.2-complete.jar -o tmp -lib tmp kscriptLexer.g4
C:\Antlr\complex>"C:\Program Files\Java\jdk-11.0.1\bin\java.exe" -jar c:\batch\antlr-4.7.2-complete.jar -o tmp -lib tmp kscriptParser.g4
C:\Antlr\complex>"C:\Program Files\Java\jdk-11.0.1\bin\javac" -cp .\;c:\batch\antlr-4.7.2-complete.jar tmp\kscript*.java
C:\Antlr\complex>cd tmp
C:\Antlr\complex\tmp>"C:\Program Files\Java\jdk-11.0.1\bin\java.exe" -cp .\;c:\batch\antlr-4.7.2-complete.jar org.antlr.v4.gui.TestRig kscript start 

c:\x\sample.kscript -tree
line 1:5 mismatched input 'GlobalValue' expecting {ID, LBKT, INT, FLOAT, STRING}
(start (part <% (part1 if (expr GlobalValue ( "operationtype" ) = "Add") %> <% else %> <% endif) %>))

yet when I use the -tokens options, I get the following stream

[@0,0:1='<%',<'<%'>,1:0]
[@1,2:3='if',<IF>,1:2]
[@2,4:4=' ',<WS>,channel=1,1:4]
[@3,5:15='GlobalValue',<ID>,1:5]
[@4,16:16='(',<'('>,1:16]
[@5,17:31='"operationtype"',<STRING>,1:17]
[@6,32:32=')',<')'>,1:32]
[@7,33:33=' ',<WS>,channel=1,1:33]
[@8,34:34='=',<'='>,1:34]

where 'GlobalValue' looks to be recognised as ID but fails to match the IF grammar rule.

sepp2k sepp2k · Accepted Answer · 2019-04-13T12:21:32

If you look at your tmp directory, you will see two .tokens files: one for the lexer and one for the parser. If you look inside them, you'll see that they assign different numbers to the tokens. Most relevantly to the issue at hand, the lexer file contains ID=29 and the parser one contains ID=11 and LE=29.

So when the lexer sees an identifier, it correctly recognizes it as such and produces token with the token type 29. The parser then sees that token and recognizes it as a LE token because the parser thinks that's what the token type 29 means.

To avoid this kind of issue, the lexer and parser should be using the same token definitions, not independent ones. You can achieve this by removing the tokens {...} block from the parser and instead using the tokenVocab option like this:

options {
    tokenVocab=kscriptLexer;
}

Different results with simple antlr grammar using intellij vs command line

1 Answers