1
votes

I'm working on an SQL grammar in ANTLR which allows quoted identifiers (table names, field names, etc), as well as quoted literal strings.

The problem is that this grammar seems to always match quoted inputs as "QUOTED_LITERAL", and never as IDs wrapped in quotes.

Here are my results:

  • input: 'blahblah' result: string_literal as expected.
  • input: field1 restul: column_name as expected
  • input: table.field1 result: column_spec as expected
  • input: 'table'.'field1' result: string_literal, MissingTokenException

Below is my simplified grammar for the expression portion of the SQL grammar, if anybody can help identify what is needed to match quoted rules other than the quoted literal, thanks.

grammar test;

 expression
    :
    simpleExpression  EOF!
    ;

simpleExpression
    :
     column_spec
    | literal_value 
;
column_spec
    :
    (table_name '.')? column_name
    | ('\''table_name '\'''.')? '\'' column_name '\''
    | ('\"'table_name '\"' '.')? '\"' column_name '\"'
    ;

string_literal:     QUOTED_LITERAL ;
boolean_literal:    'TRUE' | 'FALSE' ;
literal_value :
    ( 
    string_literal  
    | boolean_literal 
    )   
        ;

table_name :ID;
column_name  :ID;

QUOTED_LITERAL:
    (  '\'' 
        ( ('\\' '\\') | ('\'' '\'') | ('\\' '\'') | ~('\'') )* 
    '\''  )
    |
    (  '\"'
        ( ('\\' '\\') | ('\"' '\"') | ('\\' '\"') | ~('\"') )* 
    '\"'  ) 
;

ID
    :   
    ( 'A'..'Z' | 'a'..'z' ) ( 'A'..'Z' | 'a'..'z' | '_'  | '0'..'9'| '::' )*
    ;


WHITE_SPACE : ( ' '|'\r'|'\t'|'\n' ) {$channel=HIDDEN;} ;
3

3 Answers

1
votes

In case anybody is interested, I removed a little bit of the flexibility from the quoted literal strings. Literal strings can only be quoted by single quotes, and identifiers can be optionally quoted by double quotes. As long as the literal quote and the identifier quote is well defined and they don't overlap, the grammar is trivial.

This policy makes the grammar much cleaner, and doesn't remove the ability to quote identifiers. I make use of the JDBC method getIdentifierQuote to report which quote can be used to wrap identifiers.

0
votes

This is your classical shift/reduce conflict. (Except that ANTLR does not shift or reduce; since it is not a stack automaton.)

You have the following problem:

When you are in the simpleExpression state you need to decide what branch to take with one token lookahead. In the case of ANTLR, since no difference is done between lexer and parser the one token is a single character. (You should see a warning from ANTLR about the conflict.)

It gets even better, what is the difference between "Bob Dillan" and "table1"? From the parsers point of view, none. So how do you expect to make a difference between:

('\"'table_name '\"' '.')? '\"' column_name '\"'

and

(  '\"'
    ( ('\\' '\\') | ('\"' '\"') | ('\\' '\"') | ~('\"') )* 
'\"'  )

I strongly suggest to rewrite the simpleExpression rule to:

simpleExpression: 
    IDENTIFIER |
    IDENTIFIER . IDENTIFIER |
    QUOTED_LITERAL |
    QUOTED_LITERAL . QUOTED_LITERAL |
    boolean_literal;

And then decide in the action code of simpleExpression what to do. Especially since I am quite sure that you can reference a table with a quoted name; never the less "users" and "Bod Dillan" are syntactically equal.

It also depends on the grater grammar, you may also be able to resolve the amiability on a higher level.

0
votes

The antlr lexer is greedy, in that when there are two possible token matches, it will match the longest possible one.

When the lexer sees 'some_id', it can match the first quote as just a quote, or a quoted literal. The literal is longer, so that matches.

As a side note, you generally do not want lexer rules that can match nothing (like ID) or to uses string constants in the parser rules, but only reference token names.

What you want to do is something like this.

QUOTE: '\'';
ID: ('a'..'z' | 'A'..'Z')+; // Must have at least one character
QUOTED_LITERAL: QUOTE ( (ID QUOTE) => { $type=QUOTE; } ) | .* QUOTE;

id: ID | QUOTE ID QUOTE;
quoted_literal: QUOTED_LITERAL | QUOTE ID QUOTE;

If the lexer sees something that looks like a quoted id, it cannot tell which to use, so it breaks it up into smaller tokens. In your parser, you use id where you expect a possibly quoted ID, and quoted_literal where you expect a QUOTED_LITERAL.

The syntactical predicate in QUOTED_LITERAL prevents it from matching the full quote when the input is ambiguous.

Looking that this, it will fail to correctly parse lines like

'tag' text 'second'

as ' text ' will be parsed as a QUOTED_LITERAL. If that is a valid input, then you would need something like

fragment QUOTED_ID;
QUOTED_LITERAL: QUOTE ( ID {$type=QUOTED_ID} | .* ) QUOTE;
id: ID | QUOTED_ID;
quoted_literal: QUOTED_LITERAL | QUOTED_ID;

(My example does not cover all the cases in your input, but extending it should be obvious. You also probably need some actions to either generate the correct tokens in your AST or add/remove quotes from the text, depending one what you do after you parse.)