How to convert UTF-8 interpreted GB2312 encoding to real UTF-8 encoding?

Question

This is a strange scenario, not conventional converting one encoding to another one.

Question

I use Readability API to retrieve main content from given url, it works fine if the target url is encoded with UTF-8, but when target url is encoded in GB2312(one of Chinese encoding), I get rubbish information instead(the Chinese characters are wrongly encoded but English letters and digits work fine).

Deep Research

I inspected the HTTP header Readability API returns, it indicates that the encoding of API response is UTF-8.

Here's a snippet of wrongly encoded Chinese characters:

&#xC4;&#xC9;&#xB4;&#xEF;&#xB6;&#xFB;&#xBE;&#xF8;&#xBE;&#xB3;&#xCF;&#xC2;&#xB4;&#xF3;&#xB7;&#xB4;&#xBB;&#xF7;&#xBE;&#xDC;&#xBE;&#xF8;&#xC0;&#xE4;&#xC3;&#xC5;&#xC4;&#xE6;&#xD7;&#xAA;&#xBD;&#xFA;&#xBC;&#xB6;&#xD6;&#xD0;&#xCD;&#xF8;&#xCB;&#xC4;&#xC7;&#xBF;

Length: 42

Which originally are:

纳达尔绝境下大反击拒绝冷门逆转晋级中网四强

Length: 21

However, if you convert the correct Chinese into unicode, it should be:

&#x7EB3;&#x8FBE;&#x5C14;&#x7EDD;&#x5883;&#x4E0B;&#x5927;&#x53CD;&#x51FB;&#x62D2;&#x7EDD;&#x51B7;&#x95E8;&#x9006;&#x8F6C;&#x664B;&#x7EA7;&#x4E2D;&#x7F51;&#x56DB;&#x5F3A;

Tried But Not Working

iconv("GB2312", "UTF-8", $str);
iconv("GBK", "UTF-8", $str);
iconv("GB18300", "UTF-8", $str);
mb_convert_enconding($str, "UTF-8", "GB2312");
mb_convert_enconding($str, "UTF-8", "GB18300");
mb_convert_enconding($str, "UTF-8", "GBK");

Solution Requested

Since Readability API doesn't provide a parameter for charset of target url( I call this API like https://www.readability.com/api/content/v1/parser?url=http://sports.sina.com.cn/t/2013-10-04/14596813815.shtml&token=my_token_here), I have to do the convertion when handling the API response.

I will appreciate it very much if you have any idea about this issue.

Environment Info: PHP 5.3.6

Joni Joni · Accepted Answer · 2013-10-05T09:01:33

It seems that the individual bytes that make up the characters have been encoded as HTML numeric entities as if they were characters from ISO-8859-1 or some other 8-bit encoding. To undo the numeric entity encoding you can use mb_decode_numericentity:

$str = "&#xC4;&#xC9;&#xB4;&#xEF;&#xB6;&#xFB;&#xBE;&#xF8;&#xBE;&#xB3;&#xCF;&#xC2;&#xB4;&#xF3;&#xB7;&#xB4;&#xBB;&#xF7;&#xBE;&#xDC;&#xBE;&#xF8;&#xC0;&#xE4;&#xC3;&#xC5;&#xC4;&#xE6;&#xD7;&#xAA;&#xBD;&#xFA;&#xBC;&#xB6;&#xD6;&#xD0;&#xCD;&#xF8;&#xCB;&#xC4;&#xC7;&#xBF;";

$str = mb_decode_numericentity($str, array(0, 255, 0, 255), "ISO-8859-1");

echo iconv("gb2312", "utf8", $str);

This produces the expected output of 纳达尔绝境下大反击拒绝冷门逆转晋级中网四强.

How to convert UTF-8 interpreted GB2312 encoding to real UTF-8 encoding?

1 Answers