4
votes

How can I use Lex/Yacc to recognize identifiers in Chinese characters?

2

2 Answers

2
votes

I think you mean Lex (the lexer generator). Yacc is the parser generator.

According to What's the complete range for Chinese characters in Unicode?, most CJH characters fall in the 3400-9FFF range.

According to http://dinosaur.compilertools.net/lex/index.html

Arbitrary character. To match almost any character, the operator character . is the class of all characters except newline. Escaping into octal is possible although non-portable:

                             [\40-\176]

matches all printable characters in the ASCII character set, from octal 40 (blank) to octal 176 (tilde).

So I would assume what you need is something like [\32000-\117777].

1
votes

Yacc does not care about Chinese characters, but lex does: it is responsible for analyzing the input bytes (and characters) to recognize tokens. However, Chinese characters generally are multibyte. There are programs like lex which may support this, but they're not lex. It has been discussed several times.

Further reading:

The standard lexical tokenizer, lex (or flex), does not accept multi-byte characters, and is thusly impractical for many modern languages. This document describes a mapping from regular expressions describing UTF-8 multi-byte characters to a regular expressions of single bytes.