3
votes

I am new here as I am looking for a replacement for my long friends flex & bison (using lex/yacc since over 20 years). Mainly the reason for the change is the IMHO poor C++ support.

But while looking for a replacement context sensitivity would be the main selection criteria.

Am I correctly understanding that ANTLR (be it 3 or 4, as 4 has not yet C++ support) does not have any context sensitivity (meaning automatic context sensitivity, not manual predicates) ? And I am speaking of pure syntax context, not functional context like sorting out functions and constructors in C++ language which is not the duty of parsers.

Take the following as being th easy example:

Assume:

  • Rule to line up stmt's for the complete input is not included
  • Whitespace throwing away rule not included
  • Syntax may not be 100% correct but understandable for the purpose

    stmt: DEFINE SYSTEM ID '{' istmt '}' | DEFINE COMPONENT ID '{' istmt '}' ;

    DEFINE : 'define' ;

    SYSTEM : 'system' ;

    COMPONENT : 'component' ;

    ID : [A-Za-z][A-Za-z]* ;

    INT : [0-9]+

Input:

define system define { ...

Both defines will be returned as DEFINE by usual lexers as being the first longest match. The second "define" could be eaily resolved to ID by the parser passing valid tokens for the current context to the lexer, which will then select the first match among these (if any). I.e. when it comes to the ID, the lexer will identify "define" first as being the DEFINE, but this is not currently valid, so it takes the next match for the same "word" which would be ID and returns it as it is in the valid list. If no match is found in the valid list, the usual first match would be returned instead for error handling/recovery.

Of course there are more complex examples which cannot be solved at first level, but since any candidate will always macth the same lexical word looking forward buzilding and reducing a tree of possible pathes will in most cases eventually come up with a single branch identifying the correct token to be used.

Context sensitivity could be enables via an option as well as at lexer token level. I could also elaborate on a more complex example if required.

2
AFAIK, ANTLR3's C++ support is limited (experimental). It's been developed at a late stage of ANLR3 (development of v4 had already started, AFAIK), and it's been mostly developed by a single person. I'd be hesitant switching from Lex/Bison to ANTLR3 solely for its C++ support. - Bart Kiers
But the C support is pretty stable and fast. We are using that since a while yet. It's easy to wrap that with a C++ class. - Mike Lischke
C++ was the main reason I was looking around for an alternative, while context sensitivity would be the main reason for the actual selection.I have used flex/bison in some projects with C++ using code generated in C. I only had one project where using the "native" C++ parser was a huge advantage for th eparsing process. I currently have a project where this also would be the case. - Gaston

2 Answers

4
votes

Bart identifies the likely controlling issue. Some additional comments:

In Antlr, the lexer (generally) runs to completion before the parser runs, so the parser is not in a position to influence lexical context.

If the overlap in the default token value space and that specific to IDs is limited, then just include the overlap members as part of the definition of an ID (added a label as a convenience):

stmt: DEFINE SYSTEM id=( DEFINE | SYSTEM | COMPONENT | ID ) '{' istmt '}' 
    | DEFINE COMPONENT id=( DEFINE | SYSTEM | COMPONENT | ID ) '{' istmt '}' 
    ;

Where the overlap is larger or otherwise inconvenient, Antlr 4, provides lexical modes that allow explicit scoping of allowed syntax at a contextually defined points in the input stream. Antlr 3 can achieve basically the same functionality using lexer predicates.

The design decision to decouple the lexer and parser has existed since at least Antlr 2 and likely made for separation of concerns and performance reasons. Where the lexer is driven by the parser, the lexer may have to re-lex portions of the input text many times in different contexts before the parser is able to satisfy a rule. The Antlr lexer does everything it can, and largely succeeds, in processing the entire input text in a single pass. Just different approaches taken with the same good intentions.

Update: Example

If SYSTEM and COMPONENT are adequate to define the occurrence of ID and ID is defined as a single word, the lexer becomes:

// mode default:
DEFINE : 'define' ;
SYSTEM : 'system' -> pushMode(id);
COMPONENT : 'component' -> pushMode(id);
LBRACE : '{' ;
RBRACE : '}' ;
WS : [ /t/r/n] -> channel(HIDDEN);

mode id:
ID : [A-Za-z][A-Za-z]* -> popMode();
IDWS : [ /t/r/n] -> channel(HIDDEN);

// or, if more than a single ID word were allowed:
mode id:
ID : [A-Za-z][A-Za-z]*
IDLBRACE : '{' -> type(LBRACE), popMode();
IDWS : [ /t/r/n] -> channel(HIDDEN);

The parser has no awareness of lexical modes, so

stmt: DEFINE ( SYSTEM | COMPONENT ) ID LBRACE istmt RBRACE ;
0
votes

This is how I solved ambiguity of keywords and Tokens.

src/MyGrammar.g4

CONTEXT_KEYWORD_TYPE: 'type' ;
CONTEXT_KEYWORD_CONST: 'const' ;

# This token should come after all KEYWORDS
IDENTIFIER: [_a-zA-Z] [a-zA-Z0-9]* ;

# This token should come after all KEYWORDS and INDENTIFIER
RENDER_ALL_KEYWORDS: 'itWillBeReplaced';

value: ...;
typeDefinition: CONTEXT_KEYWORD_TYPE '=' '{' key=(IDENTIFIER | RENDER_ALL_KEYWORDS) ':' value '}';

In my first build step I am transforming src/MyGrammar.g4 into src/gen/MyGrammar.g4 (replace RENDER_ALL_KEYWORDS with list of all CONTEXT_KEYWORD_... separted by |):

src/gen/MyGrammar.g4

...
typeDefinition: CONTEXT_KEYWORD_TYPE '=' '{' key=(IDENTIFIER | CONTEXT_KEYWORD_TYPE | CONTEXT_KEYWORD_CONST
) ':' value '}';

Then I am compiling `src/gen/MyGrammar.g4' as my source grammar file.

In TypeScript MyLangTreeListener context have property _key:

export class MyLangRootListener implements MyLangListener {
  enterTypeDefinition(ctx: TypeDefinitionContext): void {
    ctx._key;
  }
}

This way I have context keywords that don't conflicts with IDENTIFIER token. I just have to add RENDER_ALL_KEYWORDS and use syntax myTokenName=(IDENTIFIER | RENDER_ALL_KEYWORDS).

No need to update everything when adding new context-specific keyword.