Canonicalizing token text in ANTLR

Question

Is there a way in ANTLR to mark certain tokens as having canonical output?

For example, given the grammar (excerpt)

words : FOO BAR BAZ
FOO : [Ff] [Oo] [Oo]
BAR : [Bb] [Aa] [Rr]
BAZ : [Bb] [Aa] [Zz]
SP : [ ] -> channel(HIDDEN);

words will match "FOO BAR BAZ", "foo bar baz", "Foo bAr baZ", etc.

When I call TokenStream#getText(Context), it'll return the tokens' actual text concatenated together.

Is there a way to "canonicalize" this output such that no matter what the input, all FOO tokens render as "Foo", BAR tokens render as "Bar", and BAZ tokens render as "Baz" (for example)?

Given any of the inputs above, I'd like to have the output "Foo Bar Baz".

Sam Harwell Sam Harwell · Accepted Answer · 2014-09-12T17:33:53

Any of the following options would work:

Implement your own method to obtain the text for a parse tree or range of tokens, and place the handling for certain known token types there.
Create your own Token class that knows to return the canonical form of certain tokens, and create a TokenFactory implementation that creates tokens of that type. Then use the setTokenFactory method to cause your lexer to produce those tokens.
Create your own TokenStream implementation that overrides the default behavior.
Explicitly specify the text in an action that runs prior to the creation of tokens:
```
FOO : [Ff] [Oo] [Oo] { _text = "Foo"; };
```

Other options are likely available as well.

Canonicalizing token text in ANTLR

1 Answers