0
votes

I have some text that is in Japanese, but some non-japanese Chinese characters got mixed up in it. I noticed it because the Japanese font that I use does not support them and browser renders them using a different font. As far as I've seen those characters are not used in Japanese, so they got there by mistake (text comes from OCR). I used this to find kanji in text, but it appears to mtach all Chinese characters and not just kanji. Is there any reliable way to detect those non-japanese characters, like checking certain sections of unicode?

The only solution that I can think of is making a complete list (or more like finding one) of kanji that are in use and checking each character if it's on the list, but I suspect it might be a little slow. Nonetheless if I won't find a better way to achieve this, I'll probably solve it this way.

1

1 Answers

1
votes

Is there any reliable way to detect those non-japanese characters, like checking certain sections of unicode?

No. You need to simply enumerate all Japanese characters, for example find all characters your font supports: Finding out what characters a font supports

(...)checking each character if it's on the list, but I suspect it might be a little slow.

Don't use a list, use a hashset. And if you really want a list, sort it and use binary search. It shouldn't be too slow then.