Don't read specific token at a given point

Question

At some point in my grammar file, I want ANTLR to read my input as 2 tokens instead of one. In my source file I have the value

12345.name

and the lexer consumes

12345.

as a FLOAT-Token. At this specific point in the source file I want ANTLR to read this as

12345 (INT)
. (DOT)
name (NAME)

Is there a way to tell ANTLR that it should ignore FLOAT-Types at some given point?

This is my current .g4 file:

grammar Quest;
import Lua;

@header {
package dev.codeflush.m2qc.antlr;
}

/*
prefixed everything with "m2" to avoid nameclashes
*/

m2QuestFile
    : m2Define* m2Quest* EOF
    ;

m2Define
    : 'define' NAME m2DefineValue
    ;

m2DefineValue
    : ~('\r\n' | '\r' | '\n')
    ;

m2Quest
    : 'quest' NAME 'begin' m2State* 'end'
    ;

m2State
    : 'state' NAME 'begin' (m2TriggerBlock | m2Function)* 'end'
    ;

m2TriggerBlock
    : 'when' m2Trigger ('or' m2Trigger)* ('with' exp)? 'begin' block 'end'
    ;

m2Function
    : 'function' NAME funcbody
    ;

m2Trigger
    : m2TriggerTarget DOT m2TriggerEvent DOT m2TriggerSubEvent DOT m2TriggerArgument
    | m2TriggerTarget DOT m2TriggerEvent DOT m2TriggerArgument
    | m2TriggerTarget DOT m2TriggerEvent
    | m2TriggerEvent
    ;

m2TriggerTarget
    : NAME
    | INT
    | NORMALSTRING
    ;

/*
not complete
*/
m2TriggerEvent
    : 'button'
    | 'enter'
    | 'info'
    | 'item_informer'
    | 'kill'
    | 'leave'
    | 'letter'
    | 'levelup'
    | 'login'
    | 'logout'
    | 'unmount'
    | 'target'
    | 'chat'
    | 'timer'
    | 'server_timer'
    ;

m2TriggerSubEvent
    : 'click'
    | 'chat'
    | 'arrive'
    ;

m2TriggerArgument
    : exp
    ;

DOT
    : '.'
    ;

I'm using the Lua grammar from https://github.com/antlr/grammars-v4/blob/master/lua/Lua.g4

My current sample input file looks like this:

quest test begin
    state start begin
        when kill begin
        end

        when "12345".kill begin
        end

        when 12345.kill begin
        end
    end
end

Where the first two work as intended but the third one doesn't (because the lexer reads '12345.' as one FLOAT-Token)

Would disallowing requiring at least one digit after the dot for float literals globally be an option? — sepp2k

Mike Lischke Mike Lischke · Accepted Answer · 2019-04-29T06:59:40

I had a similar need in my grammar where I wanted to issue multiple tokens (2 actually) for a single match under a specific condition (here: when a dot is directly followed by an identifier, including a keyword).

// Special rule that should also match all keywords if they are directly preceded by a dot.
// Hence it's defined before all keywords.
// Here we make use of the ability in our base lexer to emit multiple tokens with a single rule.
DOT_IDENTIFIER:
    DOT_SYMBOL LETTER_WHEN_UNQUOTED_NO_DIGIT LETTER_WHEN_UNQUOTED* { emitDot(); } -> type(IDENTIFIER)
;

A helper function is needed to emit the extra token(s):

/**
 * Puts a DOT token onto the pending token list.
 */
void MySQLBaseLexer::emitDot() {
  _pendingTokens.emplace_back(_factory->create({this, _input}, MySQLLexer::DOT_SYMBOL, _text, channel,
                                               tokenStartCharIndex, tokenStartCharIndex, tokenStartLine,
                                               tokenStartCharPositionInLine));
  ++tokenStartCharIndex;
}

which in turn requires custom handling of the token production. You have to override the nextToken method in your token stream, to consider the pending token list before returning the next real token.

/**
 * Allow a grammar rule to emit as many tokens as it needs.
 */
std::unique_ptr<antlr4::Token> MySQLBaseLexer::nextToken() {
  // First respond with pending tokens to the next token request, if there are any.
  if (!_pendingTokens.empty()) {
    auto pending = std::move(_pendingTokens.front());
    _pendingTokens.pop_front();
    return pending;
  }

  // Let the main lexer class run the next token recognition.
  // This might create additional tokens again.
  auto next = Lexer::nextToken();
  if (!_pendingTokens.empty()) {
    auto pending = std::move(_pendingTokens.front());
    _pendingTokens.pop_front();
    _pendingTokens.push_back(std::move(next));
    return pending;
  }
  return next;
}

Keep in mind: the lexer rule still issues its own token (which I set to be an IDENTIFIER here), which means you only have to issue the additional tokens.

Don't read specific token at a given point

1 Answers