How can I use Lex/Yacc to recognize identifiers in Chinese characters?
2 Answers
I think you mean Lex (the lexer generator). Yacc is the parser generator.
According to What's the complete range for Chinese characters in Unicode?, most CJH characters fall in the 3400-9FFF
range.
According to http://dinosaur.compilertools.net/lex/index.html
Arbitrary character. To match almost any character, the operator character . is the class of all characters except newline. Escaping into octal is possible although non-portable:
[\40-\176]
matches all printable characters in the ASCII character set, from octal 40 (blank) to octal 176 (tilde).
So I would assume what you need is something like [\32000-\117777]
.
Yacc
does not care about Chinese characters, but lex
does: it is responsible for analyzing the input bytes (and characters) to recognize tokens. However, Chinese characters generally are multibyte. There are programs like lex
which may support this, but they're not lex
. It has been discussed several times.
Further reading:
The standard lexical tokenizer,
lex
(orflex
), does not accept multi-byte characters, and is thusly impractical for many modern languages. This document describes a mapping from regular expressions describing UTF-8 multi-byte characters to a regular expressions of single bytes.
Flex(lexer) support for unicode (2012/3/8)
Answers point out how you can work around the limitation by using special cases of UTF-8 patterns.
Unicode Support in Flex (2009/4/26)
Essentially the same as the previous (but preceding, and a possible source for those comments)
How do I lex unicode characters in C?
An answer lists some alternative implementations which may do that was asked here.