Why does this grammar fail to parse this input?

Question

I'm defining a grammar for a small language and Antlr4. The idea is in that language, there's a keyword "function" which can be used to either define a function or as a type specifier when defining parameters. I would like to be able to do something like this:

function aFunctionHere(int a, function callback) ....

However, it seems Antlr doesn't like that I use "function" in two different places. As far as I can tell, the grammar isn't even ambiguous.

In the following grammar, if I remove LINE 1, the generated parser parses the sample input without a problem. Also, if I change the token string in either LINE 2 or LINE 3, so that they are not equal, the parser works.

The error I get with the grammar as-is:

line 1:0 mismatched input 'function' expecting <INVALID>

What does "expecting <INVALID>" mean?

The (stripped down) grammar:

grammar test;

begin : function ;

function: FUNCTION IDENTIFIER '(' parameterlist? ')' ;

parameterlist: parameter (',' parameter)+ ;

parameter: BaseParamType IDENTIFIER ;

// Lexer stuff

BaseParamType:
      INT_TYPE
    | FUNCTION_TYPE // <---- LINE 1
    ;

FUNCTION : 'function'; // <---- LINE 2

INT_TYPE : 'int';
FUNCTION_TYPE : 'function'; // <---- LINE 3

IDENTIFIER : [a-zA-Z_$]+[a-zA-Z_$0-9]*;

WS : [ \t\r\n]+ -> skip ;

The input I'm using:

function abc(int c, int d, int a)

The program to test the generated parser:

from antlr4 import *
from testLexer import testLexer as Lexer
from testParser import testParser as Parser
from antlr4.tree.Trees import Trees

def main(argv):
    input = FileStream(argv[1] if len(argv)>1 else "test.in")
    lexer =  Lexer(input)
    tokens = CommonTokenStream(lexer)
    parser = Parser(tokens)

    tree = parser.begin()

    print Trees.toStringTree(tree, None, parser)


if __name__ == '__main__':
    import sys
    main(sys.argv)

Does it work with the stripped down grammar? It's always a good idea to build up the grammar piece-wise, a little bit at a time, then it's very easy to know when and where you introduced an error. And yes, it means you have to simplify the input as well, and build it out as you add to the grammar. — Some programmer dude
If I remove the line I marked as "LINE 1" the grammar parses the sample input I posted. But of course, then it wouldn't parse function definitions with parameters of type function, which is what I need. — Gato
By the way, and on an unrelated note, you don't need the + in the IDENTIFIER regex, doing just [a-zA-Z_$][a-zA-Z_$0-9]* should be enough. — Some programmer dude

rici rici · Accepted Answer · 2016-01-17T15:53:41

Just use one name for the token function.

A token is just a token. Looking at function in isolation, it is not possible to decide whether it is a FUNCTION or a FUNCTION_TYPE. Since FUNCTION, comes first in the file, that's what the lexer used. That makes it impossible to match FUNCTION_TYPE, so that becomes an invalid token type.

The parser will figure out the syntactic role of the token function. So there would be no point using two different lexical descriptors for the same token, even if it would be possible.

In the grammar in the OP, BaseParamType is also a lexical type, which will absorb all uses of the token function, preventing FUNCTION from being recognized in the production for function. Changing its name to baseParamType, which effectively changes it to a parser non-terminal, will allow the parser to work, although I suppose it may alter the parse tree in undesirable ways.

I understand the objection that the parser "should know" which lexical tokens are possible in context, given the nature of Antlr's predictive parsing strategy. I'm far from an Antlr expert so I won't pretend to explain why it doesn't seem to work, but with the majority of parser generators -- and all the ones I commonly use -- lexical analysis is effectively performed as a prior pass to parsing, so the conversion of textual input into a stream of tokens is done prior to the parser establishing context. (Most lexical generators, including Antlr, have mechanisms with which the user can build lexical context, but IMHO these mechanisms reduce grammar readability and should only be used if strictly necessary.)

Here's the grammar file which I tested:

grammar test;  
begin : function ;
function: FUNCTION IDENTIFIER '(' parameterlist? ')' ;
parameterlist: parameter (',' parameter)+ ;
parameter: baseParamType IDENTIFIER ;

// Lexer stuff
baseParamType:
      INT_TYPE
    | FUNCTION //
    ;

FUNCTION : 'function';
INT_TYPE : 'int';
IDENTIFIER : [a-zA-Z_$]+[a-zA-Z_$0-9]*;
WS : [ \t\r\n]+ -> skip ;

Why does this grammar fail to parse this input?

1 Answers