ANTLR4 - using hidden Tokens in parser rules

Question

I'm a complete noob with ANTLR, so apologies if this is a really basic question.

I'm trying to parse a file that has a weird JSON-like syntax. These files are huge, hundreds of MB, so I'm avoiding creating the parse tree and I'm just using grammar actions to manipulate the data into what I want.

As usual, I'm sending Whitespaces and Newlines to the HIDDEN channel. However, there are a couple cases where it'd be helpful if I could detect that the next character is one of those, because that delimits the property value. Here's an excerpt from a file

  game_speed=4
  mapmode=0
  dyn_title=
  {
    title="e_dyn_188785"
    nick=nick_the_just          hist=yes
    base_title="k_mongolia"
    is_custom=yes
    is_dynamic=yes
    claim=
    {
      title=k_bulgaria
      pressed=yes
      weak=yes
    }
    claim=
    {
      title=c_karvuna
      pressed=yes
    }
    claim=
    {
      title=c_tyrnovo
    }
    claim=
    {
      title=c_mesembria
      pressed=yes
    }
  }

And here's the relevant parts of my grammar:

property: key ASSIGNMENT value { insertProp(stack[scopeLevel], $key.text, currentVal) };

key: (LOWERCASE | UPPERCASE | UNDERSCORE | DIGIT | DOT | bool)+;
value: 
  bool { currentVal = $bool.text === 'yes' } 
  | string { currentVal = $string.text.replace(/\"/gi, '') } 
  | number { currentVal = parseFloat($number.text, 10) } 
  | date { currentVal = $date.text }
  | specific_value { currentVal = $specific_value.text }
  | (numberArray { currentVal = toArray($numberArray.text) }| array)
  | object
  ;

bool: 'yes' | 'no';
number: DASH? (DIGIT+ | (DIGIT+ '.' DIGIT+));
string:
  '"' 
  ( number
    | bool
    | specific_value 
    | NONALPLHA 
    | UNDERSCORE 
    | DOT 
    | OPEN_CURLY_BRACES 
    | CLOSE_CURLY_BRACES 
  )* 
  '"'
  ;

specific_value: (LOWERCASE | UPPERCASE | UNDERSCORE | DASH | bool)+ ;


WS: ([\t\r\n] | ' ') -> channel(HIDDEN);
NEWLINE: ( '\r'? '\n' | '\r')+ -> channel(HIDDEN);

So, as you can see, the input syntax can have property values that are strings but are not delimited by ". And, in fact, for some odd reason, sometimes the next property appears on the same line. Ignoring the WS and NEWLINE means that the parser doesn't recognise that specific_value rule terminates so it grabs part of the next key as well. See output example below:

{
  game_speed: 4,
  mapmode: 0,
  dyn_title:
  { 
     title: 'e_dyn_188785',
     nick: 'nick_the_just\t\t\this',
     t: true,
     base_title: 'k_mongolia',
     is_custom: true,
     is_dynamic: true,
     claim: { title: 'k_bulgaria\n\t\t\t\tpresse', d: true, weak: true },
     claim2: { title: 'c_karvuna\n\t\t\t\tpresse', d: true },
     claim3: { title: 'c_tyrnovo' },
     claim4: { title: 'c_mesembria\n\t\t\t\tpresse', d: true 
  } 
},

What's an appropriate solution here to specify that specific_value shouldn't grab any characters once it reaches a WS or NEWLINE?

Thanks in advance! :D

Why do you not handle strings and numbers at the lexer stage? — Bart Kiers
oversight? inexperience? just didn't think to do it that way? I'm not really seeing how that would help though? — tiansivive
By glueing together specific_value -characters (and string-chars, and number-characters) in the parser, you have trouble of determining if a b is a single specific_value (ab) or 2 specific_values (a and b) because the space in between is put on a different channel by the lexer. Handling such cases in the lexer would eliminate this. I'll write a quick demo. — Bart Kiers
aha! yes, that makes sense, I'll try it out! thanks! I'd still very much appreciate the short demo if you could :) — tiansivive

Bart Kiers Bart Kiers · Accepted Answer · 2019-03-13T10:52:26

I'd handle as much a possible in the lexer (like identifiers, numbers and strings). That could look like this in your case:

grammar JsonLike;

parse
 : object? EOF
 ;

object
 : '{' key_value* '}'
 ;

key_value
 : key '=' value
 ;

key
 : SPECIFIC_VALUE
 | BOOL
 // More tokens that can be a key?
 ;

value
 : object
 | array
 | BOOL
 | STRING
 | NUMBER
 | SPECIFIC_VALUE
 ;

array
 : '[' value+ ']'
 ;

BOOL
 : 'yes'
 | 'no'
 ;

STRING
 : '"' ( ~["\\] | '\\' ["\\] )* '"'
 ;

NUMBER
 : '-'? [0-9]+ ( '.' [0-9]+ )?
 ;

SPECIFIC_VALUE
 : [a-zA-Z_] [a-zA-Z_0-9]*
 ;

SPACES
 : [ \t\r\n]+ -> channel(HIDDEN)
 ;

Resulting in the following parse:

ANTLR4 - using hidden Tokens in parser rules

1 Answers