Antlr4 Not matching composite tokens

Question

I'm trying to use Antlr4 to process the following from a file:

process example(test){
    run $test say hi
}

My grammar looks like the following:

grammar example;
main: process* EOF;

processCall: processName '(' processArg ')';

process: ('process' | 'Process' | 'PROCESS') processName '(' processArg ') {' IDENTIFIER?
        processArgReplaces IDENTIFIER? '}';
processArgReplaces: IDENTIFIER? '$' processArg IDENTIFIER?;
processName: WORD;
processArg: (WORD ',')* WORD;

WORD: [a-zA-Z0-9?_]+;

IDENTIFIER: [a-zA-Z] [ a-zA-Z0-9?_]+;
BS: [\r\n\t\f]+ -> skip;

But my output gives me no viable alternative at input 'process example name('

The problem is I need to support spaces in certain areas.

process name(arg){
    [anything here is one token]
    OR
    anotherprocess(arg) [comes out as {anotherprocess} and {arg}]
}

I've tried changing the IDENTIFIER around as I think it's taking over the match before process has a chance to, but wouldn't the explicit token mean that line wouldn't be just generic words?

Mike Lischke Mike Lischke · Accepted Answer · 2019-12-13T08:30:37

In cases like this it is always extremely helpful to print the list of tokens the lexer recognized. In your case you will get:

[@0,0:14='process example',<11>,1:0]
[@1,15:15='(',<1>,1:15]
[@2,16:19='test',<10>,1:16]
[@3,20:20=')',<2>,1:20]
[@4,27:30='run ',<11>,2:4]
[@5,31:31='$',<8>,2:8]
[@6,32:42='test say hi',<11>,2:9]
[@7,44:44='}',<7>,3:0]
[@8,46:45='<EOF>',<-1>,4:0]

As you can see the input process example is recognized as a single token, while you expected process to be recognized as a keyword. The reason for this misbehavior is the space in the IDENTIFIER rule. This is going to create a lot of problems. In our writing system the space char is a separator between words. You cannot sometimes use it like that and in other situations treat it as part of a larger token. Instead I recommend you change the grammar like that (which also converts all implicit tokens to explicit tokens, avoiding so other trouble):

grammar Example;

start: process* EOF;

processCall: processName OPEN_PAR processArg CLOSE_PAR;

process:
    PROCESS processName OPEN_PAR processArg CLOSE_PAR OPEN_CURLY IDENTIFIER? processArgReplaces IDENTIFIER? CLOSE_CURLY
;
processArgReplaces: IDENTIFIER? DOLLAR processArg IDENTIFIER?;
processName:        IDENTIFIER;
processArg:         (IDENTIFIER COMMA)* IDENTIFIER;

OPEN_PAR:    '(';
CLOSE_PAR:   ')';
OPEN_CURLY:  '{';
CLOSE_CURLY: '}';
COMMA:       ',';
DOLLAR:      '$';

PROCESS: [pP] [rR] [oO] [cC] [eE] [sS] [sS];

IDENTIFIER: [a-zA-Z] [a-zA-Z0-9?_]+;
WS:         [ \r\n\t\f]+ -> skip;

Which gives you a nice parse tree:

In your description you mention a part as [anything here is one token]. You probably want to skip all that, as you are not interested in it. However, I recommend that you still parse that part (and just leave it alone). It requires to implement that double role of the whitespaces and you may later need it anyway.

Antlr4 Not matching composite tokens

1 Answers