3
votes

I am using PHP Transliterator (from php5-intl, using ICU) to transliterate CJK to Latin (Romanization), problem is, I need some ways to specify the input locale so that Japanese Kanji are not romanized into Chinese Pinyin (as they often share the same utf8 character).

For example:

transliterator_transliterate('Any-Latin; Latin-ASCII; Lower();', $input);

中国オタク界 => zhong guo otaku jie

while i would like to get:

中国オタク界 => chuu goku otaku kai

Any idea?


Further research on the ICU site suggest the problem might be that Han-Latin only follow Pinyin transliteration, so I am looking for a way to allow php5-intl to tell ICU to use Romaji transliteration instead (I haven't found such id).

3

3 Answers

2
votes

Yes, Han-Latin means pinyin. ICU transliterators come from CLDR (I'll update the userguide to make this clear). ICU already can convert kana (hira/kata) to latin, but Kanji has more than one reading, so you won't find what you are looking for with a simple table-based conversion.

edit: so to summarize, ICU will not do what you want without writing rules, nor does it seem to me likely to be simple to do with your own rules due to how the Japanese language works.

2
votes

This is a script that I came up with to test all the different traliterators available chained with Latin-ASCII; Lower();, but non of them produce the result which you seek. You could try some other Kanji and try to pick a transliterator other than Any-Latin.

$scripts = transliterator_list_ids();

foreach ($scripts as $script) {
    echo $transliterated = transliterator_transliterate(
        $script . '; Latin-ASCII; Lower();',
        '中国オタク界'
    ) . ' in ' . $script . "\n";
}

These ones produced something meaningful and didn't act in the same way as Any-Latin: JapaneseKana-Latin/BGN, Katakana-Latin

0
votes

One possibility I could think of would be to set the locale using...

setlocale(LC_ALL, "ja_JP");

You can then apply the various formatting functionality in PHP to format the text the way you want it before running through the transliterator.