1
votes

The lexer grammar below contains two sets of rules: (1) rules for tokenizing CSV-formatted input, and (2) rules for tokenizing key/value-formatted input. For (1) I put the tokens on channel(0). For (2) I put the tokens on channel(1). Do you see any problems with my lexer grammar?

Also below is a parser grammar and it also contains two sets of rules: (1) rules for structuring CSV tokens into a parse tree, and (2) rules for for structuring key/value tokens into a parse tree. Do you see any problems with my parser grammar?

When I apply ANTLR to the grammar files, compile, and then run the test rig (with the -gui flag) using this CSV input:

FirstName, LastName, Street, City, State, ZipCode
Mark,, 4460 Stuart Street, Marion Center, PA, 15759

the parse tree is completely wrong - the tree contains no data. I have no idea why the parse tree is wrong. Any suggestions? I have tested each part separately (removed the key/value rules from the lexer and parser grammars and ran it with CSV input, removed the CSV rules from the lexer and parser grammars and ran it with key/value input) and it works fine.

Lexer Grammar

lexer grammar MyLexer;      

COMMA  : ','            -> channel(0) ;
NL     : ('\r')?'\n'    -> channel(0) ;
WS     : [ \t\r\n]+     -> skip, channel(0) ;
STRING : (~[,\r\n])+     -> channel(0) ;            

KEY       : ('FirstName' | 'LastName')  -> channel(1) ;
EQ        : '='                         -> channel(1) ;
NL2       : ('\r')?'\n'                 -> channel(1) ;
WS2       : [ \t\r\n]+                  -> skip, channel(1) ;
VALUE     : (~[=\r\n])+                  -> channel(1) ;

Parser Grammar

parser grammar MyParser;                

options { tokenVocab=MyLexer; }         

csv       : (header rows)+ EOF ;
header    : field (COMMA field)* NL ;
rows      : (row)* ;    
row       : field (COMMA field)* NL ;
field     : STRING | ;

keyValue  : pairs EOF ;
pairs     : (pair)+ ;
pair      : key EQ value NL2;
key       : KEY ;
value     : VALUE ; 
1
Why are you using channels in lexer grammar?cantSleepNow

1 Answers

1
votes

The longest token match wins and if two matches are equal-sized the first one matches. That means:

STRING subsumes KEY, EQ and VALUE, you will never get Tokens of the latter types.

The ANTLR parser needs random Access on the token stream, thus not allowing context sensitive lexing.

I suggest to put both lexer grammars into separate grammars. Maybe it gets tricky to use them with a common parser grammar. If so - split the parser grammar as well.