Latin<->Han Conversion in ICU?

Question

I am just getting started implementing ICU transforms using ICU4C in a C++ program. I am particularly looking at transliteration to and from Chinese.

According to this document, the package supports both "Han-Latin" and "Latin-Han" conversion. As a student of Chinese, this seems surprising to me, as Latin-Han conversion is particularly difficult to do without highly advanced statistical techniques (The closest I have seen is Google Transliterate, which actually does a great job with this even without user input, but this is unfeasible for the present project), much less conversion without tone marks. I am skeptical that this is even possible, without resorting to the de facto foreign-name borrowing characters such as 比尔·莫瑞. This is the approach taken by Google Maps in their international domains, as we can see in this paper (PDF)

Anyhow, I was willing to suspend disbelief, and after consulting documentation and tutorials, I was able to construct two Transliterator objects (to and from) and perform simple transliteration using them.

While Han-Latin worked pretty passably (about 80% accuracy for simple data), Latin-Han seemed not to work at all, returning the same "latin" string that was input, which is consistent with the results I get using the online transform sample, and consistent with what I know about Chinese. I managed to find this table, which I think is what is used for both sources, as we can see here:

{ "Latin-Han", "file", "t_Hani_Latn", "REVERSE" },
{ "Han-Latin", "file", "t_Hani_Latn", "FORWARD" },

I would presume this meant that given a pinyin string it could potentially work to reproduce the original, but this does not seem to be the case.

I guess my general question is this: is this kind of transform even possible with ICU, or anything besides Google Transliterate? What is the expected output? Relatedly, is there a listing somewhere of the script-pairs that ICU actually supports, if this is not really possible?

Thank you for your time

Steven R. Loomis Steven R. Loomis · Accepted Answer · 2011-04-29T23:27:05

Note that the data is from the CLDR project, http://cldr.unicode.org . The script pairs that ICU supports are many, ICU will attempt to use a pivot script ( such as Han to Latin to Russian ) which is why you can create transliterators such as "Any-Latin". You might try browsing the ICU and CLDR data set. The note at the top of the Han-Latin file says that it does not round trip.

Latin<->Han Conversion in ICU?

1 Answers