ANTLR 4 grammar gives extraneous input error

Question

I was attempting to create a (what I thought) was a simple grammar for processing a file that contains a list of key/value assignments; one assignment per line.

I have used ANTLR in the past (mid-90's) and decided to pick it up again because I wanted to provide for commenting in the assignment file and for Unicode keywords and values.

My simple test files demonstrate yet again that writing proper grammars is a hard problem even with good tools. I am using the ANTLR Language Support Plug-in for VS 2012 and developing in C#. So, I am well off the Eclipse/ Java reservation, but the C# plugin and the ANTLR Nuget packages (runtime and code generator) are working exactly as advertized.

My grammar file is:

grammar AssignmentListFile;

/*
 * See: http://en.wikipedia.org/wiki/List_of_Unicode_characters
 * for list of Unicode Code Points
 */


/*
 * Lexer Rules: Must be in all UPPER case
 * Parser Rules: Must be in all lower case
 */

// Ignore All non-printable control characters except: CR, LF and SPACE
IGNORED_WHITESPACE : 
       (
         '\u0000' .. '\u0009'  // 7-bit control chars less than Line Feed
       | '\u000B'  | '\u000C'  // Vertical tab and Form feed
       | '\u000E' .. '\u001F'  // 7-bit control chars more than Carriage Return
       | '\u007F' .. '\u009F'  // 8-bit ASCII control characters and DEL
       )+
     -> channel(HIDDEN)
     ;

// Ignore Comments and any ending white spaces
JAVADOC_COMMENT  
  : '/**' .*? '*/' [ \r\n]*
  -> channel(HIDDEN)
  ;
CSTYLE_COMMENT  
  : '/*'  .*? '*/'  [ \r\n]*
  -> channel(HIDDEN)
  ;

/*
 * Manage the assignment delimiter and 
 * the 3 white space characters which have not been ignored: SPACE, CR, and LF
 */
fragment SINGLE_SPACE : ' ';
EQUALS : '=';
EOL : SINGLE_SPACE* [\r\n]+ SINGLE_SPACE* ;
ASSIGNMENT_OPERATOR :  SINGLE_SPACE* EQUALS SINGLE_SPACE* ;

// define the various forms of single and double quotes for the dumb, open, and close variants 
                     //   ASCII    Open/Left  Close/Right
CHAR_SINGLEQUOTE : ('\u0027' | '\u2018' | '\u2019') ;
CHAR_DOUBLEQUOTE : ('\u0022' | '\u201C' | '\u201D') ;

/*
 * create the character sets that can be part of an ID
 */
fragment IDCHAR_COMMON : 
         ( '\u0020'  | '\u0021'  // Space and bang (!)
         | '\u0023' .. '\u0026'  // # to & (skips ")
         | '\u0028' .. '\u003C'  // ( to < (skips ')
         | '\u003E' .. '\u007E'  // > to ~ (skips =)
         | '\u00A0' .. '\u2018'  // printable UNICODE code points below  Open Single Quote
         | '\u201A' .. '\u201B'  // printable UNICODE code points between Close Single Quote and Open Double Quote
         | '\u201E' .. '\uFFFF'  // printable UNICODE code points above Close Double Quote
         )
       ;


// define the characters that can be contained in each of the quoted identifier types
NON_QUOTED_VALUE : IDCHAR_COMMON+;
DOUBLE_QUOTED_VALUE : NON_QUOTED_VALUE 
          | (IDCHAR_COMMON |  CHAR_SINGLEQUOTE | EQUALS)+
          ;
SINGLE_QUOTED_VALUE : NON_QUOTED_VALUE 
          | (IDCHAR_COMMON |  CHAR_DOUBLEQUOTE | EQUALS)+
          ;

file : file_line* EOF ;

file_line 
  : assignment
  | EOL
  ;

assignment
  : identifier  ASSIGNMENT_OPERATOR  identifier 
  ;

identifier 
    : NON_QUOTED_VALUE 
    | CHAR_DOUBLEQUOTE DOUBLE_QUOTED_VALUE CHAR_DOUBLEQUOTE 
    | CHAR_SINGLEQUOTE SINGLE_QUOTED_VALUE CHAR_SINGLEQUOTE
    ;

My input file is:

/*
 * This is a Multiline C-Style comment
 * with white space here:   
 */
/* this is a single line C-Style comment  */
/* this is a single line C-Style comment /w whitepace */
/*      

  */
/**/

/**
 * this is a Multiline JavaDoc comment
 * with white space here:    
 */
/** this is a single line JavaDoc comment */
/**     

  */

  /***/     

JOHN=WASHBURN
 JOHN = WASHBURN 
'JOHN'='WASHBURN'
"JOHN" = "WASHBURN"

The C# code that invokes the Lexer/Parser is:

  var input = new AntlrInputStream(textStream.ReadToEnd());
  var lexer = new AssignmentListFileLexer(input);
  var tokens = new CommonTokenStream(lexer);
  var parser = new AssignmentListFileParser(tokens);

  Console.WriteLine("\n");
  IParseTree tree = parser.file();
  Console.WriteLine(tree.ToStringTree(parser));
  Console.WriteLine("\n");

And the result from NUnit when you invoke this C# against the test file is:

line 23:0 extraneous input 'JOHN=WASHBURN' expecting {<EOF>, EOL, CHAR_SINGLEQUOTE, CHAR_DOUBLEQUOTE, NON_QUOTED_VALUE}
line 24:1 extraneous input 'JOHN = WASHBURN ' expecting {<EOF>, EOL, CHAR_SINGLEQUOTE, CHAR_DOUBLEQUOTE, NON_QUOTED_VALUE}
line 25:0 extraneous input ''JOHN'='WASHBURN'' expecting {<EOF>, EOL, CHAR_SINGLEQUOTE, CHAR_DOUBLEQUOTE, NON_QUOTED_VALUE}
line 26:0 extraneous input '"JOHN" = "WASHBURN"' expecting {<EOF>, EOL, CHAR_SINGLEQUOTE, CHAR_DOUBLEQUOTE, NON_QUOTED_VALUE}

(file JOHN=WASHBURN (file_line \r\n ) JOHN = WASHBURN  (file_line \r\n) 'JOHN'='WASHBURN' (file_line \r\n) "JOHN" = "WASHBURN" <EOF>)

First, you can see I have not even begun to test the interesting options (e.g. German Name/Values, quoted ID's that contain the = sign or the other quote chars, etc.). The test files which are all ignorable white space and/or comments parse as expected. The tree printed shows the end of line (EOL) logic seems to be on track. But, the parsing of the assignment expression itself is where the recognition error occurs.

I am puzzled how the the 4-character phrase, JOHN, (or the phrase WASHBURN) fails to match against NON_QUOTED_VALUE, or how 'JOHN' fails to match against CHAR_SINGLEQUOTE. Or how '=' or ' = ' fails to match the assignment rule.

I am sure it will be a DOH!! moment, but what have I missed here?

Marc Q. Marc Q. · Accepted Answer · 2015-02-18T11:30:25

The reason why the 4-character phrase JOHN is not recognized as a NON_QUOTED_VALUE token is that JOHN=WASHBURN is recognized as a DOUBLE_QUOTED_VALUE. Instrumenting your grammar with the following trace will show this (sorry, Java code but I'm sure you can translate).

NON_QUOTED_VALUE : IDCHAR_COMMON+  {System.out.println("#A:"+getText());};
DOUBLE_QUOTED_VALUE : NON_QUOTED_VALUE 
          | (IDCHAR_COMMON |  CHAR_SINGLEQUOTE | EQUALS)+ {System.out.println("#B:"+getText());}
          ;
SINGLE_QUOTED_VALUE : NON_QUOTED_VALUE 
          | (IDCHAR_COMMON |  CHAR_DOUBLEQUOTE | EQUALS)+ {System.out.println("#C:"+getText());}
          ;

... produces the following output ...

#B:JOHN=WASHBURN
#B:JOHN = WASHBURN 
#B:'JOHN'='WASHBURN'
#C:"JOHN" = "WASHBURN"

The reason for this is that lexer rules recognizing the longest match have priority.

In case it helps, the following grammar should recognize your sample file.

CHAR_SINGLEQUOTE : ('\u0027' | '\u2018' | '\u2019') ;
CHAR_DOUBLEQUOTE : ('\u0022' | '\u201C' | '\u201D') ;
EQUALS : '=';
EOL : [\r\n]+ ;

IGNORED_WHITESPACE : 
       ( ' '
       | '\u0000' .. '\u0009'  // 7-bit control chars less than Line Feed
       | '\u000B'  | '\u000C'  // Vertical tab and Form feed
       | '\u000E' .. '\u001F'  // 7-bit control chars more than Carriage Return
       | '\u007F' .. '\u009F'  // 8-bit ASCII control characters and DEL
       )+
     -> channel(HIDDEN)
     ;

IDCHAR_COMMON : 
         ( '\u0020'  | '\u0021'  // Space and bang (!)
         | '\u0023' .. '\u0026'  // # to & (skips ")
         | '\u0028' .. '\u003C'  // ( to < (skips ')
         | '\u003E' .. '\u007E'  // > to ~ (skips =)
         | '\u00A0' .. '\u2018'  // printable UNICODE code points below  Open Single Quote
         | '\u201A' .. '\u201B'  // printable UNICODE code points between Close Single Quote and Open Double Quote
         | '\u201E' .. '\uFFFF'  // printable UNICODE code points above Close Double Quote
         )
       ;
NON_QUOTED_VALUE : IDCHAR_COMMON+  {System.out.println("#A:"+getText());};

JAVADOC_COMMENT  
  : '/**' .*? '*/' [ \r\n]*
  -> channel(HIDDEN)
  ;
CSTYLE_COMMENT  
  : '/*'  .*? '*/'  [ \r\n]*
  -> channel(HIDDEN)
  ;


file : file_line* EOF ;

file_line 
  : assignment
  | EOL
  ;

assignment
  : identifier  EQUALS  identifier 
  ;
identifier : NON_QUOTED_VALUE 
           | CHAR_DOUBLEQUOTE (NON_QUOTED_VALUE |  CHAR_SINGLEQUOTE | EQUALS)+ CHAR_DOUBLEQUOTE 
           | CHAR_SINGLEQUOTE (NON_QUOTED_VALUE |  CHAR_DOUBLEQUOTE | EQUALS)+ CHAR_SINGLEQUOTE ;

This should also parse the following, which I have assumed from reading your grammer you consider valid.

'JO"HN'='WASHBURN'
"JO='HN" = "WASHBURN"

ANTLR 4 grammar gives extraneous input error

1 Answers