5
votes

Is there a way to select in mysql words that are only Chinese, only Japanese and only Korean?

In english it can be done by:

SELECT * FROM table WHERE field REGEXP '[a-zA-Z0-9]'

or even a "dirty" solution like:

SELECT * FROM table WHERE field > "0" AND field <"ZZZZZZZZ"

Is there a similar solution for eastern languages / CJK characters?

I understand that Chinese and Japanese share characters so there is a chance that Japanese words using these characters will be mistaken for Chinese words. I guess those words would not be filtered.

The words are stored in a utf-8 string field.

If this cannot be done in mysql, can it be done in PHP?

Thanks! :)

edit 1: The data does not include in which language the string is therefore I cannot filter by another field. edit 2: using a translator api like bing's (google is closing their translator api) is an interesting idea but i was hoping for a faster regex-style solution.

4
1) Transform your string into raw codepoints (e.g. UCS-4). 2) check each character if it's within your desired range. For CJK glyphs you may be lucky and they actually for one contiguous range (or at least only a handful).Kerrek SB
This is similar, but not identical to, stackoverflow.com/questions/1441562/…Arafangion

4 Answers

3
votes

Searching for a UTF-8 range of characters is not directly supported in MySQL regexp. See the mySQL reference for regexp where it states:

Warning The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets.

Fortunately in PHP you can build such a regexp e.g. with

/[\x{1234}-\x{5678}]*/u

(note the u at the end of the regexp). You therefore need to find the appropriate ranges for your different languages. Using the unicode code charts will enable you to pick the appropriate script for the language (although not directly the language itself).

1
votes

You can't do this from the character set alone - especially in modern times where asian texts are frequently "romanized", that is, written with the roman script, that said, if you merely want to select texts that are superficially 'asian', there are ways of doing that depending on just how complicated you want to be and how accurate you need to be.

But honestly, I suggest that you add a new "language" field to your database and ensuring that it's populated correctly.

That said, here are some useful links you may be interested in:

The latter is relatively complex to implement, but yields a much better result.

Alternatively, I believe that google has an (online) API that will allow you to detect, AND translate a language.

An interesting paper that should demonstrate the futility of this excercise is:

Finally, you ask:

If this cant be done in mysql - how can it be done in PHP?

It will likely to be much easier to do this in PHP because you are more able to perform mathematical analysis on the language string in question, although you'll probably want to feed the results back into the database as a kludgy way of caching the results for performance reasons.

0
votes

you may consider another data structure that contains the words and or characters, and the language you want to associate them with.

the 'normal' eastern ascii characters will associate to many more languages than just English for instance, just as other characters may associate to more than just Chinese.

0
votes

Korean mostly uses its own alphabet called Hangul. Occasionally there will be some Han characters thrown in.

Japanese uses three writing systems combined. Of these, Katakana and Hiragana are unique to Japanese and thus are hardly ever used in Korean or Chinese text.

Japanese and Chinese both use Han characters though which means the same Unicode range(s), so there is no simple way to differentiate them based on character ranges alone!

There are some heuristics though.

Mainland China uses simplified characters, many of which are unique and thus are hardly ever used in Japanese or Korean text.

Japan also simplified a small number of common characters, many of which are unique and thus will hardly ever be used in Chinese or Korean text.

But there are certainly plenty of occasions where the same strings of characters are valid as both Japanese and Chinese, especially in the case of very short strings.

One method that will work with all text is to look at groups of characters. This means n-grams and probably Markov models as Arafangion mentions in their answer. But be aware that even this is not foolproof in the case of very short strings!

And of course none of this is going to be implemented in any database software so you will have to do it in your programming language.