unable to parse simple grammar using antlr4

Question

I am doing some hands-on with antlr4 for parsing a file and seem to be stuck with an issue that has keep me awake for few hours now.. Following is the simple grammar i have defined in ESQLGrammar.g4 file that is placed in my projects src/main/antlr4.

grammar ESQLGrammar;

esqlCode:   
    declBrokerSchema? esqlContents;

declBrokerSchema
    :   BROKER SCHEMA schemaName (PATH schemaPathList ';')?;
schemaName
    :   IDENTIFIER;
schemaPathList
    :   IDENTIFIER (',' IDENTIFIER)*;

esqlContents
    :   (declareVariable)*?;

declareVariable
    :   DECLARE variableNames esqlDataType ';';
variableNames
    :   variableName (',' variableName)*;
variableName
    :   IDENTIFIER;
esqlDataType    
    :   (BLOB|CHARACTER|BOOLEAN|NAMESPACE);

WS              :   [ \r\t\n]+    -> skip ;
IDENTIFIER      :   [a-zA-Z_][a-zA-Z0-9_.]*;

BROKER  :   'BROKER';
SCHEMA  :   'SCHEMA';
PATH    :   'PATH';
DECLARE :   'DECLARE';
BLOB        :   'BLOB';
CHARACTER   :   'CHARACTER';
BOOLEAN :   'BOOLEAN';
NAMESPACE   :   'NAMESPACE';

My Input is a file

BROKER SCHEMA nameOfSchema PATH pathVal1,pathVal2;
DECLARE iSharedVar CHARACTER;

However, when i change the grammar lines as below to use fixed keywords with extra spaces in them

declBrokerSchema
    :   'BROKER SCHEMA ' schemaName ('PATH ' schemaPathList ';')?;
// Notice the keywords in ' ' with extra space at end.
declareVariable
    :   'DECLARE ' variableNames esqlDataType ';';

then it seems to parse the lines and throws the error provided below:

line 2:19 mismatched input 'CHARACTER' expecting {'BLOB', 'CHARACTER', 'BOOLEAN', 'NAMESPACE'}

DeclBrokerSchema(schemaName=nameOfSchema, schemaPathList=pathVal1,pathVal2)
[DeclareVariable(varibleNames=iSharedVar, dataType=CHARACTER, defultValue=null, modifier=null, isConstant=false, aliasType=null, initialValueExpression=null)]

which appears to recognise it but with an error. So need your expert view on these:

I can't find any of the rules that could have matched 'CHARACTER' during the lexer phase.. then why does it throw an error ?
Also, why does it require me to use the tokens in ' ' that too with an extra space? If i remove the space it fails to parse it..

Am i missing something here .. pls help..!

TomServo TomServo · Accepted Answer · 2017-06-07T02:04:05

Your original grammar is fine, except for a small but significant error in your lexer rules. As a result of this error, almost all of your input is being tokenized by the lexer as an IDENTIFIER:

[@0,0:5='BROKER',<IDENTIFIER>,1:0]
[@1,7:12='SCHEMA',<IDENTIFIER>,1:7]
[@2,14:25='nameOfSchema',<IDENTIFIER>,1:14]
[@3,27:30='PATH',<IDENTIFIER>,1:27]
[@4,32:39='pathVal1',<IDENTIFIER>,1:32]
[@5,40:40=',',<','>,1:40]
[@6,41:48='pathVal2',<IDENTIFIER>,1:41]
[@7,49:49=';',<';'>,1:49]
[@8,52:58='DECLARE',<IDENTIFIER>,2:0]
[@9,60:69='iSharedVar',<IDENTIFIER>,2:8]
[@10,71:79='CHARACTER',<IDENTIFIER>,2:19]
[@11,80:80=';',<';'>,2:28]
[@12,83:82='<EOF>',<EOF>,3:0]

So let's fix that by moving the lexer rule for IDENTIFIER to the bottom so that that rule doesn't match before everything else:

etc...
BLOB        :   'BLOB';
CHARACTER   :   'CHARACTER';
BOOLEAN :   'BOOLEAN';
NAMESPACE   :   'NAMESPACE';
IDENTIFIER      :   [a-zA-Z_][a-zA-Z0-9._]*;

Now if you run it the lexer tokenizes the way you'd expect:

[@0,0:5='BROKER',<'BROKER'>,1:0]
[@1,7:12='SCHEMA',<'SCHEMA'>,1:7]
[@2,14:25='nameOfSchema',<IDENTIFIER>,1:14]
[@3,27:30='PATH',<'PATH'>,1:27]
[@4,32:39='pathVal1',<IDENTIFIER>,1:32]
[@5,40:40=',',<','>,1:40]
[@6,41:48='pathVal2',<IDENTIFIER>,1:41]
[@7,49:49=';',<';'>,1:49]
[@8,52:58='DECLARE',<'DECLARE'>,2:0]
[@9,60:69='iSharedVar',<IDENTIFIER>,2:8]
[@10,71:79='CHARACTER',<'CHARACTER'>,2:19]
[@11,80:80=';',<';'>,2:28]
[@12,83:82='<EOF>',<EOF>,3:0]

Now it works. The order of lexer rules matters. Remember that rules are evaluated top to bottom. ;)

unable to parse simple grammar using antlr4

1 Answers