ANTLR parse key value list

Question

I'm new to ANTLR and I am trying to parse something like

ref:something title:(something else) blah ref other

and to obtain a list like

KEY = ref VALUE = something
KEY = title VALUE = something else
KEY = null VALUE = blah
KEY = null VALUE = ref // same ref string as item 1 key
KEY = null VALUE = other

The grammar I have is

searchCriteriaList 
    locals[List<object> s = new List<object>()]
           : t+=criteriaBean (WS t+=criteriaBean)* { $s.addAll($t); }
           ;

criteriaBean : (KEY ':' WS* expression)
             | expression ;

expression  : '(' WORD (WS WORD)* ')'
            | WORD ;

/*
 * Lexer Rules
 */

fragment A  : ('A'|'a') ;
fragment B  : ('B'|'b') ;
fragment C  : ('C'|'c') ;
fragment D  : ('D'|'d') ;
fragment E  : ('E'|'e') ;
fragment F  : ('F'|'f') ;
fragment G  : ('G'|'g') ;
fragment H  : ('H'|'h') ;
fragment I  : ('I'|'i') ;
fragment J  : ('J'|'j') ;
fragment K  : ('K'|'k') ;
fragment L  : ('L'|'l') ;
fragment M  : ('M'|'m') ;
fragment N  : ('N'|'n') ;
fragment O  : ('O'|'o') ;
fragment P  : ('P'|'p') ;
fragment Q  : ('Q'|'q') ;
fragment R  : ('R'|'r') ;
fragment S  : ('S'|'s') ;
fragment T  : ('T'|'t') ;
fragment U  : ('U'|'u') ;
fragment V  : ('V'|'v') ;
fragment W  : ('W'|'w') ;
fragment X  : ('X'|'x') ;
fragment Y  : ('Y'|'y') ;
fragment Z  : ('Z'|'z') ;


fragment LOWERCASE  : [a-z] ;
fragment UPPERCASE  : [A-Z] ;

TITLE   : T I T L E ;
MESSAGE : M E S S A G E ;
REF     : R E F ;

KEY     : TITLE | MESSAGE | REF ;
WORD    : (LOWERCASE | UPPERCASE | '_')+ ;
WS      : [ \t\u000C\r\n] ;

When I try parsing the string I get 2 exceptions and in the addAll method I end up with 3 elements rather than 5. Can someone point me into the right direction? What I am doing wrong?

Thanks, S

PS: The exception I am getting is:

Exception of type 'Antlr4.Runtime.InputMismatchException' was thrown.
InputStream: {ref:something title:(something else) blah ref other }
OffendingToken: {[@0,0:2='ref',<5>,1:0]}

Can you post the exceptions? Also, try stepping through your program with a debugger. — Impulse The Fox
updated posting with Exception details Exception of type 'Antlr4.Runtime.InputMismatchException' was thrown. InputStream: {ref:something title:(something else) blah ref other } OffendingToken: {[@0,0:2='ref',<5>,1:0]} — sanatakos

Bart Kiers Bart Kiers · Accepted Answer · 2018-03-20T17:39:32

The lexer tries to match as much characters as possible when constructing tokens. When 2 or more lexer rules match the same characters, the rule defined first "wins". With this in mind, the KEY token will never be created since TITLE, MESSAGE and REF are defined above it:

TITLE   : T I T L E ;
MESSAGE : M E S S A G E ;
REF     : R E F ;

KEY     : TITLE | MESSAGE | REF ;
WORD    : (LOWERCASE | UPPERCASE | '_')+ ;

So the input ref will always become a REF token, never a KEY or a WORD. What you need to do is create a parser rule from KEY instead.

Also, since you want a WORD to also match your keywords, you should not do this:

expression
 : '(' WORD (WS WORD)* ')'
 | WORD 
 ;

but something like this instead:

expression
 : '(' word (WS word)* ')'
 | word 
 ;

word
 : key
 | WORD
 ;

key
 : TITLE
 | MESSAGE
 | REF
 ;

Oh, and this:

fragment Z  : ('Z'|'z') ;

can be rewritten as:

fragment Z  : [Zz] ;

And is there a particular reason you're littering your parser rules with WS tokens? You could just remove them during tokenisation:

WS      : [ \t\u000C\r\n] -> skip;

ANTLR parse key value list

1 Answers