Convert Chinese characters into XML/HTML-style numerical entities and into Unicode UTF-8?

Question

I have a mixture of English words and Chinese characters, and I would like to convert the text into a mixture of English words and the XML/HTML-style numerical entities of the Chinese characters.

For example, the following mixture of English words, numbers and Chinese characters

Title: 目录.doc
Level: 1
PageNumber: 1
Begin
Title: 1 C语言概述
Level: 1
PageNumber: 13
BeginTitle: 1.1 Ｃ语言的发展过程
Level: 2
PageNumber: 13
Begin
Title: 1.2 当代最优秀的程序设计语言

would be turned into the following, with the Chinese characters replaced by their XML/HTML-style numerical entities:

Title: &#30446;&#24405;.doc
Level: 1
PageNumber: 1
Begin
Title: 1 C&#35821;&#35328;&#27010;&#36848;
Level: 1
PageNumber: 13
Begin
Title: 1.1 &#65315;&#35821;&#35328;&#30340;&#21457;&#23637;&#36807;&#31243;
Level: 2
PageNumber: 13
Begin
Title: 1.2 &#24403;&#20195;&#26368;&#20248;&#31168;&#30340;&#31243;&#24207;&#35774;&#35745;&#35821;&#35328;

I wonder if I can program this in Python?

Also possible to program for turning the Chinese characters into their Unicode UTF-8 code?

Thanks in advance!

parchment parchment · Accepted Answer · 2014-07-09T06:54:08

If s is a unicode string, s.encode('ascii', 'xmlcharrefreplace')

In python 2 you can try s.decode('utf_8').encode('ascii', 'xmlcharrefreplace')

This works in python 3.

s = '''
Title: 目录.doc
Level: 1
PageNumber: 1
Begin
Title: 1 C语言概述
Level: 1
PageNumber: 13
BeginTitle: 1.1 Ｃ语言的发展过程
Level: 2
PageNumber: 13
Begin
Title: 1.2 当代最优秀的程序设计语言
'''

print(s.encode('ascii', 'xmlcharrefreplace').decode('utf_8'))

Alternatively, you can write your own code

res = []

for b in s:
    o = ord(b)
    if o > 255:
        res.append('&#{};'.format(o))
    else:
        res.append(b)

res_string = ''.join(res)

print(res_string)

Both outputs:

Title: &#30446;&#24405;.doc
Level: 1
PageNumber: 1
Begin
Title: 1 C&#35821;&#35328;&#27010;&#36848;
Level: 1
PageNumber: 13
BeginTitle: 1.1 &#65315;&#35821;&#35328;&#30340;&#21457;&#23637;&#36807;&#31243;
Level: 2
PageNumber: 13
Begin
Title: 1.2 &#24403;&#20195;&#26368;&#20248;&#31168;&#30340;&#31243;&#24207;&#35774;&#35745;&#35821;&#35328;

You can get the unicode codes using the ord() function

c = '录'
code = ord(c)
print(code, hex(code))

Output:

24405 0x5f55

Convert Chinese characters into XML/HTML-style numerical entities and into Unicode UTF-8?

1 Answers