I am writing an ANTLR Lexer and Parser grammar that will parse text that is quite similar to a Java class. Eventually it will parse text like the following:
reference schema:"https://schema.org/";
reference dc:"https://www.dublincore.org/";
type dc:Author {
}
I am building up the Lexer and Parser slowly. I have successfully managed to parse the reference
s but have hit a wall when parsing the type
.
Before adding support for the type
I was able to use string literals for space, colon, and semi-colon in the parser but after I encountered cannot create implicit token for string literal
errors. I defined a lexer rule for each of those characters and replaced all occurrences of the literal with the rule. However this broke the parsing of reference
s.
I have included my lexer and parser that successfully parses reference
s below (along with a sample input and the parsed abstract syntax tree) and the evolved versions which isn't working. I am not getting any compilation errors but plenty of token recognition error
s (screenshot included below).
What is the correct way to handle the parsing?
Working
Lexer
lexer grammar WorkingLexerGrammar;
WS: ('\t' | '\n' | '\r' )+ -> skip ;
fragment Colon : ':';
fragment SemiColon: ';';
fragment Underscores: '_'+ ;
fragment Digits: [0-9]+ ;
fragment LowercaseLetters: [a-z]+ ;
fragment UppercaseLetters: [A-Z]+ ;
fragment String: '"' .*? '"' ;
fragment Prefix: (Underscores | Digits | LowercaseLetters)+ ;
REFERENCE_KEYWORD: 'reference' ;
TYPE_KEYWORD: 'type' ;
PREFIXED_REFERENCE: ' ' -> pushMode(PrefixedReferenceMode) ;
mode PrefixedReferenceMode;
REFERENCE_PREFIX: Prefix;
REFERENCE_PREFIX_SEPARATOR: ':' -> pushMode(IriMode);
END_IRI: ';' -> popMode;
mode IriMode;
IRI: String -> popMode;
Parser
parser grammar WorkingParserGrammar ;
options { tokenVocab=WorkingLexerGrammar; }
document: reference* EOF ;
prefixedReference: REFERENCE_PREFIX ':' IRI;
reference: REFERENCE_KEYWORD ' ' prefixedReference ';';
Input
reference schema:"https://schema.org/";
reference dc:"https://www.dublincore.org/";
Output
Evolved (not working)
Lexer
lexer grammar NotWorkingLexerGrammar;
WS: ('\t' | '\n' | '\r' )+ -> skip ;
fragment Colon : ':';
fragment SemiColon: ';';
fragment Underscores: '_'+ ;
fragment Digits: [0-9]+ ;
fragment LowercaseLetters: [a-z]+ ;
fragment UppercaseLetters: [A-Z]+ ;
fragment String: '"' .*? '"' ;
fragment Prefix: (Underscores | Digits | LowercaseLetters)+ ;
COLON: Colon;
SEMICOLON: SemiColon;
SPACE: ' ';
REFERENCE_KEYWORD: 'reference' ;
TYPE_KEYWORD: 'type' ;
PREFIXED_REFERENCE: SPACE -> pushMode(PrefixedReferenceMode) ;
mode PrefixedReferenceMode;
REFERENCE_PREFIX: Prefix;
REFERENCE_PREFIX_SEPARATOR: COLON -> pushMode(IriMode);
END_IRI: SEMICOLON -> popMode;
mode IriMode;
IRI: String -> popMode;
PREFIXED_NAME: SPACE -> pushMode(PrefixedNameMode) ;
mode PrefixedNameMode;
NAME_PREFIX: Prefix;
NAME_PREFIX_SEPARATOR: COLON -> pushMode(LocalNameMode);
END_NAME: SEMICOLON -> popMode;
mode LocalNameMode;
LOCAL_NAME: (Underscores | Digits | LowercaseLetters | UppercaseLetters)+ -> popMode;
Parser
parser grammar NotWorkingParserGrammar ;
options { tokenVocab=NotWorkingLexerGrammar; }
document: reference* type* EOF ;
prefixedReference: REFERENCE_PREFIX COLON IRI;
reference: REFERENCE_KEYWORD SPACE prefixedReference SEMICOLON;
prefixedName: NAME_PREFIX SPACE LOCAL_NAME;
type: TYPE_KEYWORD SPACE prefixedName;
Output
Following Bart Kiers' help I have made two updates to the lexer and parser grammars with varying success.
First update
This change parses the type definition correctly but only if I remove the lexer rules for reference. I think the reason for that is that the two rules are the same (i.e. PREFIXED_REFERENCE: SPACE -> pushMode(PrefixedReferenceMode) ;
for reference and PREFIXED_NAME: SPACE -> pushMode(PrefixedNameMode) ;
for type) – that is they both match on a space. My second update attempts to fix this but the full lexer and parser grammars are below.
Lexer
lexer grammar NotWorkingLexerGrammar;
WS: ('\t' | '\n' | '\r' )+ -> skip ;
fragment Underscores: '_'+ ;
fragment Digits: [0-9]+ ;
fragment LowercaseLetters: [a-z]+ ;
fragment UppercaseLetters: [A-Z]+ ;
fragment String: '"' .*? '"' ;
fragment Prefix: (Underscores | Digits | LowercaseLetters)+ ;
fragment COLON: ':';
fragment SEMICOLON: ';';
fragment SPACE: ' ';
fragment REFERENCE_KEYWORD: 'reference' ;
fragment TYPE_KEYWORD: 'type' ;
PREFIXED_REFERENCE: SPACE -> pushMode(PrefixedReferenceMode) ;
mode PrefixedReferenceMode;
REFERENCE_PREFIX: Prefix;
REFERENCE_PREFIX_SEPARATOR: COLON -> pushMode(IriMode);
END_IRI: SEMICOLON -> popMode;
mode IriMode;
IRI: String -> popMode;
PREFIXED_NAME: SPACE -> pushMode(PrefixedNameMode) ;
mode PrefixedNameMode;
NAME_PREFIX: Prefix;
NAME_PREFIX_SEPARATOR: COLON -> pushMode(LocalNameMode);
END_NAME: SEMICOLON -> popMode;
mode LocalNameMode;
LOCAL_NAME: (Underscores | Digits | LowercaseLetters | UppercaseLetters)+ -> popMode;
Parser
parser grammar NotWorkingParserGrammar ;
options { tokenVocab=NotWorkingLexerGrammar; }
document: reference* type* EOF ;
prefixedReference: REFERENCE_PREFIX REFERENCE_PREFIX_SEPARATOR IRI;
reference: REFERENCE_KEYWORD PREFIXED_REFERENCE prefixedReference END_IRI;
prefixedName: NAME_PREFIX NAME_PREFIX_SEPARATOR LOCAL_NAME;
type: TYPE_KEYWORD PREFIXED_NAME prefixedName END_NAME;
Second update
In an attempt to fix this I moved the reference
and type
keywords to the Lexer rules for the corresponding parts but this only parses the type if I remove all of the Lexer rules for reference. However references are parsed correctly.
Lexer
lexer grammar NotWorkingLexerGrammar;
WS: ('\t' | '\n' | '\r' )+ -> skip ;
fragment Underscores: '_'+ ;
fragment Digits: [0-9]+ ;
fragment LowercaseLetters: [a-z]+ ;
fragment UppercaseLetters: [A-Z]+ ;
fragment String: '"' .*? '"' ;
fragment Prefix: (Underscores | Digits | LowercaseLetters)+ ;
fragment COLON: ':';
fragment SEMICOLON: ';';
fragment SPACE: ' ';
fragment REFERENCE_KEYWORD: 'reference' ;
fragment TYPE_KEYWORD: 'type' ;
PREFIXED_REFERENCE: REFERENCE_KEYWORD SPACE -> pushMode(PrefixedReferenceMode) ;
mode PrefixedReferenceMode;
REFERENCE_PREFIX: Prefix;
REFERENCE_PREFIX_SEPARATOR: COLON -> pushMode(IriMode);
END_IRI: SEMICOLON -> popMode;
mode IriMode;
IRI: String -> popMode;
TYPE_DEFINITION: TYPE_KEYWORD SPACE -> pushMode(PrefixedNameMode) ;
mode PrefixedNameMode;
NAME_PREFIX: Prefix;
NAME_PREFIX_SEPARATOR: COLON -> pushMode(LocalNameMode);
END_NAME: SEMICOLON -> popMode;
mode LocalNameMode;
LOCAL_NAME: (Underscores | Digits | LowercaseLetters | UppercaseLetters)+ -> popMode;
Parser
parser grammar NotWorkingParserGrammar ;
options { tokenVocab=NotWorkingLexerGrammar; }
document: reference* type* EOF ;
prefixedReference: REFERENCE_PREFIX REFERENCE_PREFIX_SEPARATOR IRI;
reference: PREFIXED_REFERENCE prefixedReference END_IRI;
prefixedName: NAME_PREFIX NAME_PREFIX_SEPARATOR LOCAL_NAME;
type: TYPE_DEFINITION prefixedName END_NAME;
Output
For the following input:
reference schema:"https://schema.org/";
reference dc:"https://www.dublincore.org/";
type dc:Author;
This is the output:
line 4:0 token recognition error at: 't'
line 4:1 token recognition error at: 'y'
line 4:2 token recognition error at: 'p'
line 4:3 token recognition error at: 'e'
line 4:4 token recognition error at: ' '
line 4:5 token recognition error at: 'd'
line 4:6 token recognition error at: 'c'
line 4:7 token recognition error at: ':'
line 4:8 token recognition error at: 'A'
line 4:9 token recognition error at: 'u'
line 4:10 token recognition error at: 't'
line 4:11 token recognition error at: 'h'
line 4:12 token recognition error at: 'o'
line 4:13 token recognition error at: 'r;'
My reasoning for using modes is to limit the scope of rules. This is a language I control but would prefer not to change it dramatically. There is much more to the language than I've shown here and we have already have a grammar (currently a combined grammar) but it is quite brittle. I tried to make a change to prevent uppercase characters in prefixes but permit them in the local name but this snowballed and other rules started applying. Research suggested that modes was an approach to handle this situation but I'm not very familiar with ANTLR so I've possibly misunderstood it.