lexer skips a token

Question

I am trying to do basic ANTLR-based scanning. I have a problem with a lexer not matching wanted tokens.

lexer grammar DefaultLexer;

ALPHANUM    :   (LETTER | DIGIT)+;
ACRONYM     :   LETTER '.' (LETTER '.')+;
HOST        :   ALPHANUM (('.' | '-') ALPHANUM)+;

fragment
LETTER  :   UNICODE_CLASS_LL | UNICODE_CLASS_LM | UNICODE_CLASS_LO | UNICODE_CLASS_LT | UNICODE_CLASS_LU;

fragment
DIGIT   :   UNICODE_CLASS_ND | UNICODE_CLASS_NL;

For the grammar above, hello. world string given as an input results in world only. Whereas I would expect to get both hello and world. What am I missing? Thanks.

ADDED:

Ok, I learned that input hello. world matches more characters using rule HOST than ALPHANUM, therefore lexer will choose to use it. Then, when it fails to match input to the HOST rule, it does not "look back" to , because that's how lexer works.

How I get around it?

Sam Harwell Sam Harwell · Accepted Answer · 2013-07-02T11:17:04

As a foreword, ANTLR 4 would not behave in a strange manner here. Both ANTLR 3 and ANTLR 4 should be matching ALPHANUM, then giving 2 syntax errors, then matching another ALPHANUM, and I can state with confidence that ANTLR 4 will behave that way.

It looks like your HOST rule might be better suited to be host, a parser rule.
You need to make sure and provide a lexer rule that can match . (either together or as two separate tokens).

lexer skips a token

1 Answers