5
votes

I'm trying to implement an expression/formula language in ANTLR4 and having a problem with whitespace handling. In most cases I don't care about whitespace, so I have the "standard" lexer rule to send it to the HIDDEN channel, i.e.

// Whitespace
WS
    :   ( ' ' | '\t' |'\r' | '\n' ) -> channel(HIDDEN)
    ;

However I have one operator which doesn't allow whitespace either before or after, and I can't see how to handle the situation without changing the WS lexer rule to leave the whitespace in the default channel and having explicit WS? terms in all of my other parser rules (there are quite a lot of them).

As simplified example, I created the following grammar for an imaginary predicate language:

grammar Logik;

/*
 * Parser Rules
 */

ruleExpression
    :   orExpression
    ;

orExpression
    :   andExpression ( 'OR' andExpression)*
    ;

andExpression
    :   primaryExpression ( 'AND' primaryExpression)*
    ;

primaryExpression
    :   variableExpression
    |   '(' ruleExpression ')'
    ;

variableExpression
    :   IDENTIFIER ( '.' IDENTIFIER )*
    ;

/*
 * Lexer Rules
 */

IDENTIFIER
    :   LETTER LETTERORDIGIT*
    ;

fragment LETTER : [a-zA-Z_];
fragment LETTERORDIGIT : [a-zA-Z0-9_];

// Whitespace
WS
    :   ( ' ' | '\t' |'\r' | '\n' ) -> channel(HIDDEN)
    ;

As it stands, this parses A OR B AND C.D and A OR B AND C. D successfully - what I need is for the . operator to not allow whitespace, so that the second expression isn't valid.

2

2 Answers

4
votes

You can get the token from other channels like this:

variableExpression
  :   IDENTIFIER ( '.' {_input.get(_input.index() -1).getType() != WS}? IDENTIFIER )*
  ;

A OR B AND C.D is OK and

A OR B AND C. D will print an error

2
votes

You can use a lexer predicate in order to perform a lookahead (and behind) and creating a dedicated token for '.' . In your example, it looks like this:

grammar Logik;

/*
 * Parser Rules
 */

ruleExpression
    :   orExpression
    ;

orExpression
    :   andExpression ( 'OR' andExpression)*
    ;

andExpression
    :   primaryExpression ( 'AND' primaryExpression)*
    ;

primaryExpression
    :   variableExpression
    |   '(' ruleExpression ')'
    ;

variableExpression
    :   IDENTIFIER ( POINT IDENTIFIER )*
    ;

/*
 * Lexer Rules
 */

POINT : {_input.LA(-1) != ' ' && _input.LA(2) != ' '}? '.';
IDENTIFIER
    :   LETTER LETTERORDIGIT*
    ;

fragment LETTER : [a-zA-Z_];
fragment LETTERORDIGIT : [a-zA-Z0-9_];

// Whitespace
WS
    :   ( ' ' | '\t' |'\r' | '\n' ) -> channel(HIDDEN)
    ;

This way, A OR B AND C.D is OK and A OR B AND C. D would give an error (as A OR B AND C .D) like: token recognition error at: '.' ...

NOTE

There is probably way of playing with the HIDDEN_CHANEL and with semantic predicates into the grammar rules section. But if you have the same constraint many times you will have to write the predicate un each grammar rule where the constraint should be enabled.