11
votes

I found this question which gives me the ability to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct but they seem to return false for Japanese and Korean and true for Chinese.

What it doesn't do is tell if the character is traditional or simplified Chinese. How would you go about finding this out?


update

Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?

http://unicode.org/faq/han_cjk.html

Their argument that the characters regardless of their shape have the same meaning and therefore should be represented by the same code. Well, it's not meaningless to me because I am analyzing individual characters which doesn't work with their solution:

A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.

3
would the codepage help distinguish? Seems like simplified Chinese is CP 936 and Traditional is CP 950, at least in the Microsoft world. Perhaps start at i18nguy.com/unicode/codepages.html for the MS and IBM codepages.rajah9
I did a quick google search and found this unicode.org/faq/han_cjk.html I found some of the questions interesting and they discuss Traditional characters in there too. Hope it helps!Shaded
Shaded's linked FAQ seems to answer your question exactly. As the example in the link notes, how would you determine if "chat" is English or French? If you don't think that your answer is in there, you might want to expand your question a bit.Thanatos
It's a good link, one that I got to prior. Ah quite complicated. The orthography of chat/chat en/fn surely makes it indistinguishable; however, if we used the IPA to write chat/chat [ʃæ/tʃæt] it would be possible through syllable construction because it would be based on sound and not an archaic orthography.thenengah
But Chinese is much less complicated because 說/说 [ t/s shuo1 'to speak'] are completely different characters one being the traditional equivalent to speak and one being the simplified equivalent to speak. They have different unicode values opposed to a/a en/fn which share the same character code.thenengah

3 Answers

1
votes

As I think you've discovered, you can't. Simplified and traditional are just two styles of writing the same characters - it's like the difference between Roman and Gothic script for European languages.

5
votes

As already stated, you can't reliably detect the script style from a single character, but it is possible for a sufficiently long sample of text. See https://github.com/jpatokal/script_detector for a Ruby gem that does the job, and Simplified Chinese Unicode table for a general discussion.

5
votes

It is possible for some characters. The Traditional and Simplified character sets overlap, so you have basically three sets of characters:

  1. Characters that are traditional only.
  2. Characters that are simplified only.
  3. Characters that have been left untouched, and are available in both.

Take the character 面 for instance. It belongs both to #2 and #3... As a simplified character, it stands for and , face and noodles. Whereas 麵 is a traditional character only. So in the Unihan database, 麵 has a kSimplifiedVariant, which points to . So you can deduct that it is a traditional character only.

But also has a kTraditionalVariant, which points to . This is where the system breaks: if you use this data to deduct that 面 is a simplified character only, you'd be wrong...

On the other hand, has a kTraditionalVariant, pointing to , and these two are a "real" Simplified/Traditional pair. But nothing in the Unihan database differentiates cases like 韓/韩 from cases like 麵/面.