I want to be able to recognize Chinese, Japanese, and Korean written characters, both as a general group and as subdivided languages. These are the reasons:
- Recognize CJK as a general group: I am making a vertical script Mongolian
TextView
. To do that I need to rotate the line of text 90 degrees because the glyphs are stored horizontally in the font. However, for CJK languages, I need to rotate them back again so that they are written in their correct orientation but just stacked on top of each other down the line. - Differentiate CJK into specific languages: I'm also making a Mongolian dictionary and when users enter a CJK character to lookup I would like to automatically recognize the language. Because Chinese characters are also used by Japanese and Koreans, I'm guessing that I won't be able to fully accomplish this but I want to do it to the maximum extent that the coding allows.
On the linguistic side, the subcategories that I am aware of are
- Chinese traditional characters
- Chinese simplified characters
- Japanese Kanji (Chinese characters)
- Japanese Hiragana (native alphabet)
- Japanese Katakana (alphabet for writing foreign words)
- Korean Hangul (phonetic)
- Korean Hanja (Chinese Characters)
For the sake of completeness, Chinese characters are also used in Vietnamese (so CJK is also called CJKV). For my current purposes I don't need to worry about it, but it could be a future consideration. I am also ignoring romanized scripts like Chinese pinyin or Japanese romaji. They will be handled the same as English and Mongolian in the TextView (ie, rotated 90 degrees with the rest of the line). Bopomofo used in Taiwan could also be a future consideration, but I will ignore it for now. See also here and here for language examples.
I've seen a number of related questions that usually deal with one specific language in Java or Android but no overarching question with a canonical answer. Other questions are more general for Unicode but don't tell how to do it in Java and Android. Here are some of the specific ones.
- How to check whether given text is english or chinese in android?
- How can I detect japanese text in a Java string?
- Check if string contains CJK (chinese) characters
- Use regular expression to match ANY Chinese character in utf-8 encoding
- Testing for Japanese/Chinese Characters in a string
- Different representation of unicode code points in Japanese and chinese
- Check if a character is Traditional Chinese in Big-5 (Java)?
- Unicode characters necessary for Japanese, Korean, and Chinese
- Does same chinese characters shared by cjk share same unicode value?
- What's the complete range for Chinese characters in Unicode?
So my question is, how much can I differentiate the the CJK languages using Unicode codepoints and how can I test for them in Android? I've seen some newer tests in Java and Android, and while these are useful to know, I also need to support older Android devices.