Parsing inline and multi-line statements using antlr4's lexer modes

Question

I am currently working on an Island grammar parser for parsing two programming languages (DSLs) in the same file. The statements of the second programming language always starts with a special char (*) but they can take two forms: inline statements or multiple lines.

In case of an inline statement, the line starts with * and terminates with a newline (\r? \n).

In case of multiple lines, statement starts with * and the statement may expand to a number of lines followed by a semicolon.

I am having a difficulty of using Antlr4's lexer modes to accomplish this. Can somebody point me in the right direction?

I have given my grammar below. The parser shows two errors for the following example

line 5:21 extraneous input '\n' expecting {<EOF>, ID, SWITCH_CHAR}
line 8:31 extraneous input '\n' expecting {<EOF>, ID, SWITCH_CHAR}

Example:

first programming language 
*example one second programming language inline statement ending with semicolon;
*example two another valid second programming language inline statement ending with newline
*example three second programming language may expand to the next line
until semicolon char;
*example four second programming language example may expand 
to a number of lines
too ending with semicolon char;
first programming language again

Lexer:

lexer grammar ComplexLanguageLexer;
/*** SEA ****/
ID: [a-z]+;
WS: [ \t\f]+ -> skip;
SWITCH_CHAR: '*' -> pushMode(inline_mode), pushMode(multiline_mode);
NEWLINE:  '\r'? '\n' -> skip;

/***ISLANDS****/
mode multiline_mode;
MULTILINE_SWITCH_CHAR: ';' -> popMode;  //seek until ';'
MULTILINE_ID: [a-z]+;
MULTILINE_WS: [ \t\f]+ -> skip;
MULTILINE_NEWLINE:  '\r'? '\n' -> skip; //just skip newlines in the multiline mode

mode inline_mode;
INLINE_NEWLINE:  '\r'? '\n' -> type(NEWLINE), popMode;
INLINE_SEMICOLONCHAR: ';' ; //just match semicolonchar
INLINE_ID: [a-z]+;
INLINE_WS: [ \t\f]+ -> skip;

Grammar:

parser grammar ComplexLanguageParser;
options { tokenVocab = ComplexLanguageLexer ; }

startRule:   programStatement+;

programStatement:
    word | inlineStatement| multilineStatement
;

word: ID;

inlineStatement:
    SWITCH_CHAR INLINE_ID+ INLINE_SEMICOLONCHAR? NEWLINE
;

multilineStatement:
    SWITCH_CHAR MULTILINE_ID+ MULTILINE_SWITCH_CHAR
;

Update

I've updated the lexer/parser grammar following @GRosenberg's instructions:

Lexer

lexer grammar ComplexLanguageLexer;

SWITCH_CHAR: STAR -> pushMode(second_mode) ;
ID1         : ID ;
WS1         : WS -> skip ;
NL1         : NL -> skip ;

fragment STAR : '*' ;
fragment ID   : [a-z]+ ;
fragment WS   : [ \t\f]+ ;
fragment NL   : '\r'? '\n' ;

mode second_mode ;
    TERM1 : ( WS | NL )* SEMI -> popMode ;
    TERM2 : WS* NL -> popMode ;
    ID2   : ID ;
    WS2   : WS+ -> skip ;
    NL2   : NL;
    SEMI : ';';

Grammar

parser grammar ComplexLanguageParser;
options { tokenVocab = ComplexLanguageLexer ; }

startRule:  programStatement+;
programStatement:   firstLanguageStatement | secondLanguageStatment ;
firstLanguageStatement:    word ;
secondLanguageStatment:    SWITCH_CHAR (inlineStatement| multilineStatement)     ;
word: ID1;
multilineStatement:    (ID2|NL2)+ TERM1;
inlineStatement:   ID2+ TERM2;

It is working as intended for the inline statements, but still does not work for the multi-line statements. Not sure what I am doing wrong here?

e.g.

first language            -> ok
*second language inline   -> ok 
*multi line;              -> ok
*multi line expands to 
 next line;                ->  token recognition error at ';'
*multi line
;                          -> ok
first language again       -> ok

GRosenberg GRosenberg · Accepted Answer · 2016-08-10T18:56:31

The pushMode and popMode commands are implemented using a single stack. So, the rule

SWITCH_CHAR: '*' -> pushMode(inline_mode), pushMode(multiline_mode);

should result in the lexer evaluating the multiline_mode rules. On pop, the lexer will be evaluating the inline_mode rules. Unlikely what is desired.

Better to implement a single lexer mode that correctly handles all of the second language statements. Basic idea is:

SWITCH_CHAR : STAR -> pushMode(second_mode) ;

mode second_mode ;
    STMT1 : ( ID | WS | NL )+ SEMI -> popMode() ;
    STMT2 : ( ID | WS )+ NL -> popMode() ;

Untested, but should work provided ID does not include either STAR or SEMI.

Update

To expose ID to the parser, just break it out of the statement rules:

SWITCH_CHAR: STAR -> pushMode(second_mode) ;
ID1         : ID ;
WS1         : WS -> skip ;
NL1         : NL -> skip ;

fragment STAR : '*' ;
fragment ID   : [a-z]+ ;
fragment WS   : [ \t\f]+ ;
fragment NL   : '\r'? '\n' ;

mode second_mode ;
    TERM1 : ( WS | NL )* SEMI -> popMode() ;
    TERM2 : WS+ NL -> popMode() ;
    ID2   : ID ;
    WS2   : WS+ -> skip ;

This, however, allows an ambiguity:

 *example two inline statement ending with newline
 first programming language again (including a semicolon)

If this is valid, there is simply insufficient structure to disambiguate without using native code.

Before going there, a possibly better design choice would be to defer any distinction between the first and second languages to the parser or, better, an analysis of the generated parse tree.

Parsing inline and multi-line statements using antlr4's lexer modes

1 Answers