Use regular expression to match ANY Chinese character in utf-8 encoding

votes

For example, I want to match a string consisting of m to n Chinese characters, then I can use:

[single Chinese character regular expression]{m,n}

Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?

regexunicodeflex-lexernon-english

At the very least, please provide information on the regex engine you're using. – Lily Ballard

@KevinBallard I am not quite sure which engine I am using. What I know is I use the regular expression functionality in flex(the lexer) – xiaohan2012

Possible duplicate of How to make a flex (lexical scanner) to read UTF-8 characters input? – Thomas Dickey

flex won't do this; answers which assume it does do not address the question. – Thomas Dickey

6 Answers

votes

The regex to match a Chinese (well, CJK) character is

\p{script=Han}

which can be appreviated to simply

\p{Han}

This assumes that your regex compiler meets requirement RL1.2 Properties from UTS#18 Unicode Regular Expressions. Perl and Java 7 both meet that spec, but many others do not.

votes

In Java,

\p{InCJK_UNIFIED_IDEOGRAPHS}{1,3}

votes

Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?

Recommendation

To match patterns with Chinese characters and other Unicode code points with a Flex-compatible lexical analyzer, you could use the RE/flex lexical analyzer for C++ that is backwards compatible with Flex. RE/flex supports Unicode and works with Bison to build lexers and parsers.

You can write Unicode patterns (and UTF-8 regular expressions) in RE/flex specifications such as:

%option flex unicode
%%
[肖晗]   { printf ("xiaohan/2\n"); }
%%

Use global %option unicode to enable Unicode. You can also use a local modifier (?u:) to restrict Unicode to a single pattern (so everything else is still ASCII/8-bit as in Flex):

%option flex
%%
(?u:[肖晗])   { printf ("xiaohan/2\n"); }
(?u:\p{Han})  { printf ("Han character %s\n", yytext); }
.             { printf ("8-bit character %d\n", yytext[0]); }
%%

Option flex enables Flex compatibility, so you can use yytext, yyleng, ECHO, and so on. Without the flex option RE/flex expects Lexer method calls: text() (or str() and wstr() for std::string and std::wstring), size() (or wsize() for wide char length), and echo(). RE/flex method calls are cleaner IMHO, and include wide char operations.

Background

In plain old Flex I ended up defining ugly UTF-8 patterns to capture ASCII letters and UTF-8 encoded letters for a compiler project that required support for Unicode identifiers id:

digit           [0-9]
alpha           ([a-zA-Z_\xA8\xAA\xAD\xAF\xB2\xB5\xB7\xB8\xB9\xBA\xBC\xBD\xBE]|[\xC0-\xFF][\x80-\xBF]*|\\u([0-9a-fA-F]{4}))
id              ({alpha})({alpha}|{digit})*

The alpha pattern supports ASCII letters, underscore, and Unicode code points that are used in identifiers (\p{L} etc). The pattern permits more Unicode code points than absolutely necessary to keep the size of this pattern manageable, so it trades compactness for some lack of accuracy and to permit UTF-8 overlong characters in some cases that are not valid UTF-8. If you are thinking about this approach than be wary about the problems and safety concerns. Use a Unicode-capable scanner generator instead, such as RE/flex.

Safety

When using UTF-8 directly in Flex patterns, there are several concerns:

Encoding your own UTF-8 patterns in Flex for matching any Unicode character may be prone to errors. Patterns should be restricted to characters in the valid Unicode range only. Unicode code points cover the range U+0000 to U+D7FF and U+E000 to U+10FFFF. The range U+D800 to U+DFFF is reserved for UTF-16 surrogate pairs and are invalid code points. When using a tool to convert a Unicode range to UTF-8, make sure to exclude invalid code points.
Patterns should reject overlong and other invalid byte sequences. Invalid UTF-8 should not be silently accepted.
To catch lexical input errors in your lexer will require a special . (dot) that matches valid and invalid Unicode, including UTF-8 overruns and invalid byte sequences, in order to produce an error message that the input is rejected. If you use dot as a "catch-all-else" to produce an error message, but your dot does not match invalid Unicode, then you lexer will hang ("scanner is jammed") or your lexer will ECHO rubbish characters on the output by the Flex "default rule".
Your scanner should recognize a UTF BOM (Unicode Byte Order Mark) in the input to switch to UTF-8, UTF-16 (LE or BE), or UTF-32 (LE or BE).
As you point out, patterns such as [unicode characters] do not work at all with Flex because UTF-8 characters in a bracket list are multibyte characters and each single byte character can be matched but not the UTF-8 character.

See also invalid UTF encodings in the RE/flex user guide.

votes

In C#

new Regex(@"\p{IsCJKUnifiedIdeographs}")

Here it is in the Microsoft docs

And here's more info from Wikipedia: CJK Unified Ideographs

The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,976 basic Chinese characters in the range U+4E00 through U+9FEF. The block not only includes characters used in the Chinese writing system but also kanji used in the Japanese writing system and hanja, whose use is diminishing in Korea. Many characters in this block are used in all three writing systems, while others are in only one or two of the three. Chinese characters are also used in Vietnam's Nôm script (now obsolete).

votes

Just solved a similar problem,

when you have too much stuff to match, is better use a negated-set and declare what you don't want to match like:

all but not numbers: ^[^0-9]*$

the second ^ will implement the negation

-1

votes

In Java 7 and up, the format should be: "\p{IsHan}"