ANTLR4 : clean grammar and tree with keywords (aliases ?)

Question

I am looking for a solution to a simple problem.

The example :

SELECT date, date(date)
FROM date;

This is a rather stupid example where a table, its column, and a function all have the name "date".

The snippet of my grammar (very simplified) :

simple_select
    : SELECT selected_element (',' selected_element) FROM from_element ';'
    ;

selected_element
    : function
    | REGULAR_WORD
    ;

function
    : REGULAR_WORD '(' function_argument ')'
    ;

function_argument
    : REGULAR_WORD
    ;

from_element
    : REGULAR_WORD
    ;


DATE:     D A T E;
FROM:     F R O M;
SELECT:   S E L E C T;

REGULAR_WORD
    : (SIMPLE_LETTER) (SIMPLE_LETTER | '0'..'9')*
    ;

fragment SIMPLE_LETTER
    : 'a'..'z'
    | 'A'..'Z'
    ;

DATE is a keyword (it is used somewhere else in the grammar). If I want it to be recognised by my grammar as a normal word, here are my solutions :

1) I add it everywhere I used REGULAR_WORD, next to it. Example :

selected_element
    : function
    | REGULAR_WORD
    | DATE
    ;

=> I don't want this solution. I don't have only "DATE" as a keyword, and I have many rules using REGULAR_WORD, so I would need to add a list of many (50+) keywords like DATE to many (20+) parser rules : it would be absolutely ugly.

PROS: make a clean tree

CONS: make a dirty grammar

2) I use a parser rule in between to get all those keywords, and then, I replace every occurrence of REGULAR_WORD by that parser rule. Example :

word
    : REGULAR_WORD
    | DATE
    ;

selected_element
    : function
    | word
    ;

=> I do not want this solution either, as it adds one more parser rule in the tree and polluting the informations (I do not want to know that "date" is a word, I want to know that it's a selected_element, a function, a function_argument or a from_element ...

PROS: make a clean grammar

CONS: make a dirty tree

Either way, I have a dirty tree or a dirty grammar. Isn't there a way to have both clean ?

I looked for aliases, parser fragment equivalent, but it doesn't seem like ANTLR4 has any ?

Thank you, have a nice day !

I'm afraid it this problem has no solution. It is called "context sensitive lexer". It is a price you have to pay, when you want backward compatible grammar (like Oracle SQL for example - they do distinguish "keywords" and "reserved words") — ibre5041
@ibre5041 : Thank you for the answer. That is what I feared. What do you mean though, by "It is called "context sensitive lexer"" ? Do you mean that the grammar that would solve my issue would have a "context sensitive lexer" ? — Kronos

rici rici · Accepted Answer · 2019-09-06T15:53:14

There are four different grammars for SQL dialects in the Antlr4 grammar repository and all four of them use your second strategy. So it seems like there is a consensus among Antlr4 sql grammar writers. I don't believe there is a better solution given the design of the Antlr4 lexer.

As you say, that leads to a bit of noise in the full parse tree, but the relevant non-terminal (function, selected_element, etc.) is certainly present and it does not seem to me to be very difficult to collapse the unit productions out of the parse tree.

As I understand it, when Antlr4 was being designed, a decision was made to only automatically produce full parse trees, because the design of condensed ("abstract") syntax trees is too idiosyncratic to fit into a grammar DSL. So if you find an AST more convenient, you have the responsibility to generate one yourself. That's generally straight-forward although it involves a lot of boilerplate.

Other parser generators do have mechanisms which can handle "semireserved keywords". In particular, the Lemon parser generator, which is part of the Sqlite project, includes a %fallback declaration which allows you to specify that one or more tokens should be automatically reclassified in a context in which no grammar rule allows them to be used. Unfortunately, Lemon does not generate Java parsers.

Another similar option would be to use a parser generator which supports "scannerless" parsing. Such parsers typically use algorithms like Earley/GLL/GLR, capable of parsing arbitrary CFGs, to get around the need for more lookahead than can conveniently be supported in fixed-lookahead algorithms such as LALR(1).

ANTLR4 : clean grammar and tree with keywords (aliases ?)

2 Answers