0
votes

I am currently working on an Island grammar parser for parsing two programming languages (DSLs) in the same file. The statements of the second programming language always starts with a special char (*) but they can take two forms: inline statements or multiple lines.

In case of an inline statement, the line starts with * and terminates with a newline (\r? \n).

In case of multiple lines, statement starts with * and the statement may expand to a number of lines followed by a semicolon.

I am having a difficulty of using Antlr4's lexer modes to accomplish this. Can somebody point me in the right direction?

I have given my grammar below. The parser shows two errors for the following example

line 5:21 extraneous input '\n' expecting {<EOF>, ID, SWITCH_CHAR}
line 8:31 extraneous input '\n' expecting {<EOF>, ID, SWITCH_CHAR}

Example:

first programming language 
*example one second programming language inline statement ending with semicolon;
*example two another valid second programming language inline statement ending with newline
*example three second programming language may expand to the next line
until semicolon char;
*example four second programming language example may expand 
to a number of lines
too ending with semicolon char;
first programming language again

Lexer:

lexer grammar ComplexLanguageLexer;
/*** SEA ****/
ID: [a-z]+;
WS: [ \t\f]+ -> skip;
SWITCH_CHAR: '*' -> pushMode(inline_mode), pushMode(multiline_mode);
NEWLINE:  '\r'? '\n' -> skip;

/***ISLANDS****/
mode multiline_mode;
MULTILINE_SWITCH_CHAR: ';' -> popMode;  //seek until ';'
MULTILINE_ID: [a-z]+;
MULTILINE_WS: [ \t\f]+ -> skip;
MULTILINE_NEWLINE:  '\r'? '\n' -> skip; //just skip newlines in the multiline mode

mode inline_mode;
INLINE_NEWLINE:  '\r'? '\n' -> type(NEWLINE), popMode;
INLINE_SEMICOLONCHAR: ';' ; //just match semicolonchar
INLINE_ID: [a-z]+;
INLINE_WS: [ \t\f]+ -> skip;

Grammar:

parser grammar ComplexLanguageParser;
options { tokenVocab = ComplexLanguageLexer ; }

startRule:   programStatement+;

programStatement:
    word | inlineStatement| multilineStatement
;

word: ID;

inlineStatement:
    SWITCH_CHAR INLINE_ID+ INLINE_SEMICOLONCHAR? NEWLINE
;

multilineStatement:
    SWITCH_CHAR MULTILINE_ID+ MULTILINE_SWITCH_CHAR
;

Update

I've updated the lexer/parser grammar following @GRosenberg's instructions:

Lexer

lexer grammar ComplexLanguageLexer;

SWITCH_CHAR: STAR -> pushMode(second_mode) ;
ID1         : ID ;
WS1         : WS -> skip ;
NL1         : NL -> skip ;

fragment STAR : '*' ;
fragment ID   : [a-z]+ ;
fragment WS   : [ \t\f]+ ;
fragment NL   : '\r'? '\n' ;

mode second_mode ;
    TERM1 : ( WS | NL )* SEMI -> popMode ;
    TERM2 : WS* NL -> popMode ;
    ID2   : ID ;
    WS2   : WS+ -> skip ;
    NL2   : NL;
    SEMI : ';';

Grammar

parser grammar ComplexLanguageParser;
options { tokenVocab = ComplexLanguageLexer ; }

startRule:  programStatement+;
programStatement:   firstLanguageStatement | secondLanguageStatment ;
firstLanguageStatement:    word ;
secondLanguageStatment:    SWITCH_CHAR (inlineStatement| multilineStatement)     ;
word: ID1;
multilineStatement:    (ID2|NL2)+ TERM1;
inlineStatement:   ID2+ TERM2;

It is working as intended for the inline statements, but still does not work for the multi-line statements. Not sure what I am doing wrong here?

e.g.

first language            -> ok
*second language inline   -> ok 
*multi line;              -> ok
*multi line expands to 
 next line;                ->  token recognition error at ';'
*multi line
;                          -> ok
first language again       -> ok
1

1 Answers

1
votes

The pushMode and popMode commands are implemented using a single stack. So, the rule

SWITCH_CHAR: '*' -> pushMode(inline_mode), pushMode(multiline_mode);

should result in the lexer evaluating the multiline_mode rules. On pop, the lexer will be evaluating the inline_mode rules. Unlikely what is desired.

Better to implement a single lexer mode that correctly handles all of the second language statements. Basic idea is:

SWITCH_CHAR : STAR -> pushMode(second_mode) ;

mode second_mode ;
    STMT1 : ( ID | WS | NL )+ SEMI -> popMode() ;
    STMT2 : ( ID | WS )+ NL -> popMode() ;

Untested, but should work provided ID does not include either STAR or SEMI.

Update

To expose ID to the parser, just break it out of the statement rules:

SWITCH_CHAR: STAR -> pushMode(second_mode) ;
ID1         : ID ;
WS1         : WS -> skip ;
NL1         : NL -> skip ;

fragment STAR : '*' ;
fragment ID   : [a-z]+ ;
fragment WS   : [ \t\f]+ ;
fragment NL   : '\r'? '\n' ;

mode second_mode ;
    TERM1 : ( WS | NL )* SEMI -> popMode() ;
    TERM2 : WS+ NL -> popMode() ;
    ID2   : ID ;
    WS2   : WS+ -> skip ;

This, however, allows an ambiguity:

 *example two inline statement ending with newline
 first programming language again (including a semicolon)

If this is valid, there is simply insufficient structure to disambiguate without using native code.

Before going there, a possibly better design choice would be to defer any distinction between the first and second languages to the parser or, better, an analysis of the generated parse tree.