I am currently working on an Island grammar parser for parsing two programming languages (DSLs) in the same file. The statements of the second programming language always starts with a special char (*) but they can take two forms: inline statements or multiple lines.
In case of an inline statement, the line starts with * and terminates with a newline (\r? \n).
In case of multiple lines, statement starts with * and the statement may expand to a number of lines followed by a semicolon.
I am having a difficulty of using Antlr4's lexer modes to accomplish this. Can somebody point me in the right direction?
I have given my grammar below. The parser shows two errors for the following example
line 5:21 extraneous input '\n' expecting {<EOF>, ID, SWITCH_CHAR}
line 8:31 extraneous input '\n' expecting {<EOF>, ID, SWITCH_CHAR}
Example:
first programming language
*example one second programming language inline statement ending with semicolon;
*example two another valid second programming language inline statement ending with newline
*example three second programming language may expand to the next line
until semicolon char;
*example four second programming language example may expand
to a number of lines
too ending with semicolon char;
first programming language again
Lexer:
lexer grammar ComplexLanguageLexer;
/*** SEA ****/
ID: [a-z]+;
WS: [ \t\f]+ -> skip;
SWITCH_CHAR: '*' -> pushMode(inline_mode), pushMode(multiline_mode);
NEWLINE: '\r'? '\n' -> skip;
/***ISLANDS****/
mode multiline_mode;
MULTILINE_SWITCH_CHAR: ';' -> popMode; //seek until ';'
MULTILINE_ID: [a-z]+;
MULTILINE_WS: [ \t\f]+ -> skip;
MULTILINE_NEWLINE: '\r'? '\n' -> skip; //just skip newlines in the multiline mode
mode inline_mode;
INLINE_NEWLINE: '\r'? '\n' -> type(NEWLINE), popMode;
INLINE_SEMICOLONCHAR: ';' ; //just match semicolonchar
INLINE_ID: [a-z]+;
INLINE_WS: [ \t\f]+ -> skip;
Grammar:
parser grammar ComplexLanguageParser;
options { tokenVocab = ComplexLanguageLexer ; }
startRule: programStatement+;
programStatement:
word | inlineStatement| multilineStatement
;
word: ID;
inlineStatement:
SWITCH_CHAR INLINE_ID+ INLINE_SEMICOLONCHAR? NEWLINE
;
multilineStatement:
SWITCH_CHAR MULTILINE_ID+ MULTILINE_SWITCH_CHAR
;
Update
I've updated the lexer/parser grammar following @GRosenberg's instructions:
Lexer
lexer grammar ComplexLanguageLexer;
SWITCH_CHAR: STAR -> pushMode(second_mode) ;
ID1 : ID ;
WS1 : WS -> skip ;
NL1 : NL -> skip ;
fragment STAR : '*' ;
fragment ID : [a-z]+ ;
fragment WS : [ \t\f]+ ;
fragment NL : '\r'? '\n' ;
mode second_mode ;
TERM1 : ( WS | NL )* SEMI -> popMode ;
TERM2 : WS* NL -> popMode ;
ID2 : ID ;
WS2 : WS+ -> skip ;
NL2 : NL;
SEMI : ';';
Grammar
parser grammar ComplexLanguageParser;
options { tokenVocab = ComplexLanguageLexer ; }
startRule: programStatement+;
programStatement: firstLanguageStatement | secondLanguageStatment ;
firstLanguageStatement: word ;
secondLanguageStatment: SWITCH_CHAR (inlineStatement| multilineStatement) ;
word: ID1;
multilineStatement: (ID2|NL2)+ TERM1;
inlineStatement: ID2+ TERM2;
It is working as intended for the inline statements, but still does not work for the multi-line statements. Not sure what I am doing wrong here?
e.g.
first language -> ok
*second language inline -> ok
*multi line; -> ok
*multi line expands to
next line; -> token recognition error at ';'
*multi line
; -> ok
first language again -> ok