3
votes

I have been using antlr4 to parse a German document and so far I have done the following to parse the text that includes German characters:

LETTERS:
[a-zA-Z_\u00DC\u00FC\u00D6\u00F6\u00C4\u00E4\u00DF]; // hex unicodes for ÜüÖöÄäß

what is the best way to describe lingual characters of all languages in Unicode in a way that antlr understands, without specifying each language/character individually? say, the french, Arabic, or Chinese, Japanese characters?

Thank you

1

1 Answers

2
votes

The best way is to use character ranges corresponding to the desired Unicode classes. Even then, the result can be a bit clumsy. See this worked example.

The raw data available in the Unicode standard's Appendix tables can be stripped and munged into a usable format with just a bit too much effort. ;)