antlr4: token is not recognised as intended

Question

I am trying to build a grammar using antlr4 that should be able to store intermediate parsing results as variables which can be accessed for later use. I thought about using a key word, like as (or the German als), which will trigger this storing functionality. Besides this I have a general-purpose token ID that will match any possible identifier. The storing ability should be an option for the user. Therefore, I am using the ? in my grammar definition.

My grammar looks as follows:

grammar TokenTest;

@header {
package some.package.declaration;
}

AS : 'als' ;
VALUE_ASSIGNMENT : AS ID ;

ID : [a-zA-Z_][a-zA-Z0-9_]+ ;

WS : [ \t\n\r]+ -> skip ;

ANY : . ;

formula  :  identifier=ID (variable=VALUE_ASSIGNMENT)?  #ExpressionIdentifier
;

There are no failures when compiling this grammar. But, when I try to apply the following TestNG-tests I cannot explain its behaviour:

package some.package.declaration;

import java.util.List;

import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Token;
import org.testng.Assert;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;

import some.package.declaration.TokenTestLexer;

public class TokenTest {

    private static List<Token> getTokens(final String input) {
        final TokenTestLexer lexer = new TokenTestLexer(CharStreams.fromString(input));
        final CommonTokenStream tokens = new CommonTokenStream(lexer);
        tokens.fill();
        return tokens.getTokens();
    }

    @DataProvider (name = "tokenData")
    public Object[][] tokenData() {
        return new Object [][] {
            {"result", new String[] {"result"}, new int[] {TokenTestLexer.ID}},
            {"als", new String[] {"als"}, new int[] {TokenTestLexer.AS}},
            {"result als x", new String[] {"result", "als", "x"}, new int[] {TokenTestLexer.ID, TokenTestLexer.AS, TokenTestLexer.ID}},
        };
    }

    @Test (dataProvider = "tokenData")
    public void testTokenGeneration(final String input, final String[] expectedTokens, final int[] expectedTypes) {
//      System.out.println("test token generation for <" + input + ">");
        Assert.assertEquals(expectedTokens.length, expectedTypes.length);
        final List<Token> parsedTokens = getTokens(input);
        Assert.assertEquals(parsedTokens.size()-1/*EOF is a token*/, expectedTokens.length);
        for (int index = 0; index < expectedTokens.length; index++) {
            final Token currentToken = parsedTokens.get(index);
            Assert.assertEquals(currentToken.getText(), expectedTokens[index]);
            Assert.assertEquals(currentToken.getType(), expectedTypes[index]);
        }
    }

}

The second test tells me that the word als is parsed as an AS-token. But, the third test does not work as intended. I assume it to be an ID-token, followed by an AS-token, and finally followed by an ID-token. But instead, the last token will be recognized as an ANY-token.

If I change the definition of the AS-token as follows:

fragment AS : 'als' ;

there is another strange behaviour. Of course, the second test case does not work any longer, since there is no AS-token any more. Thats no surprise. Instead, the x in the third test case will be recognized as an ANY-token. But, I assume the whole "als x"-sequence to be a VALUE_ASSIGNMENT-token. What am I doing wrong? Any help would be really nice.

Kind regards!

Good first question: clear problem description, observed- and expected behavior. And a small grammar to reproduce it. Wish all questions were half as good as yours :thumbsup: — Bart Kiers

Bart Kiers Bart Kiers · Accepted Answer · 2020-10-06T14:45:37

But, the third test does not work as intended. I assume it to be an ID-token, followed by an AS-token, and finally followed by an ID-token. But instead, the last token will be recognized as an ANY-token

That is because you defined:

ID : [a-zA-Z_][a-zA-Z0-9_]+ ;

where the + means "one or more". What you probably want is "zero or more":

ID : [a-zA-Z_][a-zA-Z0-9_]* ;

But, I assume the whole "als x"-sequence to be a VALUE_ASSIGNMENT-token. What am I doing wrong?

Note that spaces are skipped in parser rules, not lexer rules. This means that VALUE_ASSIGNMENT will only match alsFOO, and not als FOO. This rules should probably be a parser rules instead:

value_assignment : AS ID ;

antlr4: token is not recognised as intended

1 Answers