I am using ICU4C to transliterate CJK. I am wondering whether it is possible to have word segmentation in ICU, to split Chinese text into a sequence of words, defined according to some word segmentation standard.
When I try transliterating for example:
直接输出html代码而不是作为函数返回值代后处理
using
Transliterator* myTrans =
Transliterator::createInstance("zh-Latin",UTRANS_FORWARD, err);
UnicodeString str;
str.setTo("直接输出html代码而不是作为函数返回值代后处理");
myTrans->transliterate(str);
str.toUTF8String(st);
std::cout << st << std::endl;
I get the following output:
zhí jiē shū chū html dài mǎ ér bù shì zuò wèi hán shù fǎn huí zhí dài hòu chù lǐ
It seems perfectly fine checking against online pinyin tools, but my problem is ICU's transliteration the characters one by one. What I'm looking for, though, is something more like the text below (I don't know any Chinese, so probably the text below doesn't mean anything, but it should demonstrate what kind of output I'm interested in):
zhíjiē shūchū html dàimǎér bùshì zuò wèihán shùfǎn huízhídài hòu chùlǐ
I have been told that ICU 50 is capable of word segmentation, but I couldn't find any document in their web page neither on web. Wanted to know if any of you guys have worked with word segmentation in ICU or know how to do it, or if you have any good link on how to do so.