Selecting Chinese only, Japanese only and Korean only records in mysql/php

Question

Is there a way to select in mysql words that are only Chinese, only Japanese and only Korean?

In english it can be done by:

SELECT * FROM table WHERE field REGEXP '[a-zA-Z0-9]'

or even a "dirty" solution like:

SELECT * FROM table WHERE field > "0" AND field <"ZZZZZZZZ"

Is there a similar solution for eastern languages / CJK characters?

I understand that Chinese and Japanese share characters so there is a chance that Japanese words using these characters will be mistaken for Chinese words. I guess those words would not be filtered.

The words are stored in a utf-8 string field.

If this cannot be done in mysql, can it be done in PHP?

Thanks! :)

edit 1: The data does not include in which language the string is therefore I cannot filter by another field. edit 2: using a translator api like bing's (google is closing their translator api) is an interesting idea but i was hoping for a faster regex-style solution.

1) Transform your string into raw codepoints (e.g. UCS-4). 2) check each character if it's within your desired range. For CJK glyphs you may be lucky and they actually for one contiguous range (or at least only a handful). — Kerrek SB
This is similar, but not identical to, stackoverflow.com/questions/1441562/… — Arafangion

borrible borrible · Accepted Answer · 2011-07-06T11:36:24

Searching for a UTF-8 range of characters is not directly supported in MySQL regexp. See the mySQL reference for regexp where it states:

Warning The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets.

Fortunately in PHP you can build such a regexp e.g. with

/[\x{1234}-\x{5678}]*/u

(note the u at the end of the regexp). You therefore need to find the appropriate ranges for your different languages. Using the unicode code charts will enable you to pick the appropriate script for the language (although not directly the language itself).

Selecting Chinese only, Japanese only and Korean only records in mysql/php

4 Answers