2
votes

I'm trying to use Antlr4 to process the following from a file:

process example(test){
    run $test say hi
}

My grammar looks like the following:

grammar example;
main: process* EOF;

processCall: processName '(' processArg ')';

process: ('process' | 'Process' | 'PROCESS') processName '(' processArg ') {' IDENTIFIER?
        processArgReplaces IDENTIFIER? '}';
processArgReplaces: IDENTIFIER? '$' processArg IDENTIFIER?;
processName: WORD;
processArg: (WORD ',')* WORD;

WORD: [a-zA-Z0-9?_]+;

IDENTIFIER: [a-zA-Z] [ a-zA-Z0-9?_]+;
BS: [\r\n\t\f]+ -> skip;

But my output gives me no viable alternative at input 'process example name('

The problem is I need to support spaces in certain areas.

process name(arg){
    [anything here is one token]
    OR
    anotherprocess(arg) [comes out as {anotherprocess} and {arg}]
}

I've tried changing the IDENTIFIER around as I think it's taking over the match before process has a chance to, but wouldn't the explicit token mean that line wouldn't be just generic words?

1

1 Answers

2
votes

In cases like this it is always extremely helpful to print the list of tokens the lexer recognized. In your case you will get:

[@0,0:14='process example',<11>,1:0]
[@1,15:15='(',<1>,1:15]
[@2,16:19='test',<10>,1:16]
[@3,20:20=')',<2>,1:20]
[@4,27:30='run ',<11>,2:4]
[@5,31:31='$',<8>,2:8]
[@6,32:42='test say hi',<11>,2:9]
[@7,44:44='}',<7>,3:0]
[@8,46:45='<EOF>',<-1>,4:0]

As you can see the input process example is recognized as a single token, while you expected process to be recognized as a keyword. The reason for this misbehavior is the space in the IDENTIFIER rule. This is going to create a lot of problems. In our writing system the space char is a separator between words. You cannot sometimes use it like that and in other situations treat it as part of a larger token. Instead I recommend you change the grammar like that (which also converts all implicit tokens to explicit tokens, avoiding so other trouble):

grammar Example;

start: process* EOF;

processCall: processName OPEN_PAR processArg CLOSE_PAR;

process:
    PROCESS processName OPEN_PAR processArg CLOSE_PAR OPEN_CURLY IDENTIFIER? processArgReplaces IDENTIFIER? CLOSE_CURLY
;
processArgReplaces: IDENTIFIER? DOLLAR processArg IDENTIFIER?;
processName:        IDENTIFIER;
processArg:         (IDENTIFIER COMMA)* IDENTIFIER;

OPEN_PAR:    '(';
CLOSE_PAR:   ')';
OPEN_CURLY:  '{';
CLOSE_CURLY: '}';
COMMA:       ',';
DOLLAR:      '$';

PROCESS: [pP] [rR] [oO] [cC] [eE] [sS] [sS];

IDENTIFIER: [a-zA-Z] [a-zA-Z0-9?_]+;
WS:         [ \r\n\t\f]+ -> skip;

Which gives you a nice parse tree:

enter image description here

In your description you mention a part as [anything here is one token]. You probably want to skip all that, as you are not interested in it. However, I recommend that you still parse that part (and just leave it alone). It requires to implement that double role of the whitespaces and you may later need it anyway.