No context sensitivity in ANTLR4?

Question

I am new here as I am looking for a replacement for my long friends flex & bison (using lex/yacc since over 20 years). Mainly the reason for the change is the IMHO poor C++ support.

But while looking for a replacement context sensitivity would be the main selection criteria.

Am I correctly understanding that ANTLR (be it 3 or 4, as 4 has not yet C++ support) does not have any context sensitivity (meaning automatic context sensitivity, not manual predicates) ? And I am speaking of pure syntax context, not functional context like sorting out functions and constructors in C++ language which is not the duty of parsers.

Take the following as being th easy example:

Assume:

Rule to line up stmt's for the complete input is not included
Whitespace throwing away rule not included
Syntax may not be 100% correct but understandable for the purpose

stmt: DEFINE SYSTEM ID '{' istmt '}' | DEFINE COMPONENT ID '{' istmt '}' ;

DEFINE : 'define' ;

SYSTEM : 'system' ;

COMPONENT : 'component' ;

ID : [A-Za-z][A-Za-z]* ;

INT : [0-9]+

Input:

define system define { ...

Both defines will be returned as DEFINE by usual lexers as being the first longest match. The second "define" could be eaily resolved to ID by the parser passing valid tokens for the current context to the lexer, which will then select the first match among these (if any). I.e. when it comes to the ID, the lexer will identify "define" first as being the DEFINE, but this is not currently valid, so it takes the next match for the same "word" which would be ID and returns it as it is in the valid list. If no match is found in the valid list, the usual first match would be returned instead for error handling/recovery.

Of course there are more complex examples which cannot be solved at first level, but since any candidate will always macth the same lexical word looking forward buzilding and reducing a tree of possible pathes will in most cases eventually come up with a single branch identifying the correct token to be used.

Context sensitivity could be enables via an option as well as at lexer token level. I could also elaborate on a more complex example if required.

AFAIK, ANTLR3's C++ support is limited (experimental). It's been developed at a late stage of ANLR3 (development of v4 had already started, AFAIK), and it's been mostly developed by a single person. I'd be hesitant switching from Lex/Bison to ANTLR3 solely for its C++ support. — Bart Kiers
But the C support is pretty stable and fast. We are using that since a while yet. It's easy to wrap that with a C++ class. — Mike Lischke
C++ was the main reason I was looking around for an alternative, while context sensitivity would be the main reason for the actual selection.I have used flex/bison in some projects with C++ using code generated in C. I only had one project where using the "native" C++ parser was a huge advantage for th eparsing process. I currently have a project where this also would be the case. — Gaston

GRosenberg GRosenberg · Accepted Answer · 2015-03-06T01:35:44

Bart identifies the likely controlling issue. Some additional comments:

In Antlr, the lexer (generally) runs to completion before the parser runs, so the parser is not in a position to influence lexical context.

If the overlap in the default token value space and that specific to IDs is limited, then just include the overlap members as part of the definition of an ID (added a label as a convenience):

stmt: DEFINE SYSTEM id=( DEFINE | SYSTEM | COMPONENT | ID ) '{' istmt '}' 
    | DEFINE COMPONENT id=( DEFINE | SYSTEM | COMPONENT | ID ) '{' istmt '}' 
    ;

Where the overlap is larger or otherwise inconvenient, Antlr 4, provides lexical modes that allow explicit scoping of allowed syntax at a contextually defined points in the input stream. Antlr 3 can achieve basically the same functionality using lexer predicates.

The design decision to decouple the lexer and parser has existed since at least Antlr 2 and likely made for separation of concerns and performance reasons. Where the lexer is driven by the parser, the lexer may have to re-lex portions of the input text many times in different contexts before the parser is able to satisfy a rule. The Antlr lexer does everything it can, and largely succeeds, in processing the entire input text in a single pass. Just different approaches taken with the same good intentions.

Update: Example

If SYSTEM and COMPONENT are adequate to define the occurrence of ID and ID is defined as a single word, the lexer becomes:

// mode default:
DEFINE : 'define' ;
SYSTEM : 'system' -> pushMode(id);
COMPONENT : 'component' -> pushMode(id);
LBRACE : '{' ;
RBRACE : '}' ;
WS : [ /t/r/n] -> channel(HIDDEN);

mode id:
ID : [A-Za-z][A-Za-z]* -> popMode();
IDWS : [ /t/r/n] -> channel(HIDDEN);

// or, if more than a single ID word were allowed:
mode id:
ID : [A-Za-z][A-Za-z]*
IDLBRACE : '{' -> type(LBRACE), popMode();
IDWS : [ /t/r/n] -> channel(HIDDEN);

The parser has no awareness of lexical modes, so

stmt: DEFINE ( SYSTEM | COMPONENT ) ID LBRACE istmt RBRACE ;

No context sensitivity in ANTLR4?

2 Answers