5
votes

I am having some troubles in handling whitespace. In the following excerpt of a grammar, I set up the lexer so that the parser skips whitespace:

ENTITY_VAR
    : 'user'
    | 'resource'
    ;

INT : DIGIT+ | '-' DIGIT+ ;
ID : LETTER (LETTER | DIGIT | SPECIAL)* ;
ENTITY_ID : '__' ENTITY_VAR ('_w_' ID)?;

NEWLINE : '\r'? '\n';

WS : [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines

fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
fragment SPECIAL : ('_' | '#' );

The problem is, I would like to match against variables names of the form ENTITY_ID such that the matched string does not have any whitespace. It would be sufficient to write it as a lexer rule as I did here, but the thing is that I'd like to do it with a parser rule instead, because I want to have direct access to those two tokens ENTITY_VAR and ID individually from my code, and not squeeze them back together in a whole token ENTITY_ID.

Any ideas, please? Basically any solution which let me access directly ENTITY_VAR and ID would suit me, both by leaving ENTITY_ID as a lexer rule or moving it to the parser.

3
Perhaps lexical modes can help? Once you stumble upon '__', you switch modes where you don't skip spaces?Bart Kiers
Thanks for your suggestion. So, if I understand correctly, I'd write a parser rule entityVar such that when matching against '__' it switches to a mode where the WS lexer rule is disabled?Riccardo T.
Ops, I meant entityId, not VarRiccardo T.
Lexer modes are not dependent on parser rules. Whenever the lexer matches '__', it would switch modes, regardless if there is a parser rule that actually uses the '__' token.Bart Kiers
Could you give some sample input code and what you'd like to have?Onur

3 Answers

2
votes

There are several approaches I can think of (not in a special order):

  1. Emit several tokens from the rule ENTITY_ID. See ANTLR4: How to inject tokens for an inspiration
  2. Allow whitespace in the parser and check afterwards
  3. Use the single token and split in code
  4. Use the single token and modify the token stream before passing it to the parser. I.e. lex, modify the ENTITY_ID tokens and split them into several other tokens, then pass this stream to the parser
  5. Don't skip whitespace and when dealing with these "extra tokens" check if they are within a ENTITY_ID part (=> is error) or not (=> ignore error).
  6. Don't skip whitespace and add "WS*" everywhere in your grammar where whitespace is allowed (ok if the grammar is not too large).
  7. Insert predicates in the parser rule that checks if there is whitespace between.
  8. Create a "trap" rule like this:

    INVALID_ENTITY_ID : '__' WS+ ENTITY_VAR WS? ('_w_' WS? ID)?
                      | '__' WS? ENTITY_VAR WS+ ('_w_' WS? ID)?
                      | '__' WS? ENTITY_VAR WS? ('_w_' WS+ ID)
                      ;
    

    This will catch invalid ENTITY_IDs since it's longer than the parts that will then be also individual tokens.

I'd go with 2, if it doesn't alter the parse in the "non error" case, i.e. no code is interpreted differently by allowing whitespace.

0
votes

As far as I managed to understand by browsing the documentation, it doesn't look like something like that is feasible.

Parser rules seem to work just on the default channel, so I can't send WS to channel(HIDDEN) and then recover it just for a single parser rule.

On the other hand, an author of antlr explains here that it's not possible to break down any token since version 4.

Even though I don't like it at all, it seems that the fastest way is to parse it from the lexer (as in the code from the question), only to get to re-parse it again from Java the whole string.

Still, any other better option or correction to my conclusions is welcome.

0
votes

Hooking two parsers in a sort of pipeline, as your own answer suggets, is a sound and simple design/solution, and I'm pretty sure ANTLR is capable of helping with that.

I don't know far the ANTLR folks have gone in their work on stream/feed parsing. But, adopting a two-pass strategy should be efficient enough as the first pass would be just lexing a regular language, which is O(c * N) over the size of the input with a very small c.

If you want a single pass that costs O(k * N) (with a large k), you could consider PEG, for which there are implementations in Java (which I haven't tried).