I'm a complete noob with ANTLR, so apologies if this is a really basic question.
I'm trying to parse a file that has a weird JSON-like syntax. These files are huge, hundreds of MB, so I'm avoiding creating the parse tree and I'm just using grammar actions to manipulate the data into what I want.
As usual, I'm sending Whitespaces and Newlines to the HIDDEN channel. However, there are a couple cases where it'd be helpful if I could detect that the next character is one of those, because that delimits the property value. Here's an excerpt from a file
game_speed=4
mapmode=0
dyn_title=
{
title="e_dyn_188785"
nick=nick_the_just hist=yes
base_title="k_mongolia"
is_custom=yes
is_dynamic=yes
claim=
{
title=k_bulgaria
pressed=yes
weak=yes
}
claim=
{
title=c_karvuna
pressed=yes
}
claim=
{
title=c_tyrnovo
}
claim=
{
title=c_mesembria
pressed=yes
}
}
And here's the relevant parts of my grammar:
property: key ASSIGNMENT value { insertProp(stack[scopeLevel], $key.text, currentVal) };
key: (LOWERCASE | UPPERCASE | UNDERSCORE | DIGIT | DOT | bool)+;
value:
bool { currentVal = $bool.text === 'yes' }
| string { currentVal = $string.text.replace(/\"/gi, '') }
| number { currentVal = parseFloat($number.text, 10) }
| date { currentVal = $date.text }
| specific_value { currentVal = $specific_value.text }
| (numberArray { currentVal = toArray($numberArray.text) }| array)
| object
;
bool: 'yes' | 'no';
number: DASH? (DIGIT+ | (DIGIT+ '.' DIGIT+));
string:
'"'
( number
| bool
| specific_value
| NONALPLHA
| UNDERSCORE
| DOT
| OPEN_CURLY_BRACES
| CLOSE_CURLY_BRACES
)*
'"'
;
specific_value: (LOWERCASE | UPPERCASE | UNDERSCORE | DASH | bool)+ ;
WS: ([\t\r\n] | ' ') -> channel(HIDDEN);
NEWLINE: ( '\r'? '\n' | '\r')+ -> channel(HIDDEN);
So, as you can see, the input syntax can have property values that are strings but are not delimited by "
. And, in fact, for some odd reason, sometimes the next property appears on the same line. Ignoring the WS and NEWLINE means that the parser doesn't recognise that specific_value
rule terminates so it grabs part of the next key as well. See output example below:
{
game_speed: 4,
mapmode: 0,
dyn_title:
{
title: 'e_dyn_188785',
nick: 'nick_the_just\t\t\this',
t: true,
base_title: 'k_mongolia',
is_custom: true,
is_dynamic: true,
claim: { title: 'k_bulgaria\n\t\t\t\tpresse', d: true, weak: true },
claim2: { title: 'c_karvuna\n\t\t\t\tpresse', d: true },
claim3: { title: 'c_tyrnovo' },
claim4: { title: 'c_mesembria\n\t\t\t\tpresse', d: true
}
},
What's an appropriate solution here to specify that specific_value
shouldn't grab any characters once it reaches a WS or NEWLINE?
Thanks in advance! :D
specific_value
-characters (andstring
-chars, andnumber
-characters) in the parser, you have trouble of determining ifa b
is a singlespecific_value
(ab
) or 2specific_value
s (a
andb
) because the space in between is put on a different channel by the lexer. Handling such cases in the lexer would eliminate this. I'll write a quick demo. – Bart Kiers