Testing for Japanese/Chinese Characters in a string

Question

I have a program that reads a bunch of text and analyzes it. The text may be in any language, but I need to test for japanese and chinese specifically to analyze them a different way.

I have read that I can test each character on it's unicode number to find out if it is in the range of CJK characters. This is helpful, however I would like to separate them if possible to process the text against different dictionaries. Is there a way to test if a character is Japanese OR Chinese?

If you don't know the code set, it may actually make your life easier rather than having everything in unicode. — Elijah
I wind up converting everything to unicode anyway for analyzing (I'm forced to, really). I can detect the codeset before the conversion; this question is more if the codeset is already unicode. — landyman
As an addition to this question. What if you need to detect if a character is Chinese or Japanese. It doesn't matter which of the two it is. I am currently trying to match anything in \p{Han}\p{Hiragana}\p{Katakana} but the following characters are not matching: 发同讲说宅电的手机告的世全所回广讲说跟 — yarian

Elijah Elijah · Accepted Answer · 2009-04-24T16:52:19

You won't be able to test a single character to tell with certainty that it is Japanese or Chinese because of the way the unihan code points are implemented in the Unicode standard. Basically, every Chinese character is a potential Japanese character. However, the reverse is not true. Also, there are a number of conventions that could be used to test to see if a block of text is in one language or the other.

Simplifications - if the character you are testing is a PRC simplification such as 门 is only available in main land Chinese.
Kana - if the character is one of the many Japanese kana characters such as あいうえお　then the text block you are working with is definitely Japanese.

The problem arises with the sheer number of characters and words that are in common. However, if I needed a quick and dirty solution to this problem, I would check my entire blocks of text for kana - if the text contains kana then I know it is Japanese. If you need to distinguish Korean as well, I would test for Hangul. Also, if you need to distinguish what type of Chinese, testing for types of simplifications would be the best approach.

Testing for Japanese/Chinese Characters in a string

6 Answers