1
votes

This is a strange scenario, not conventional converting one encoding to another one.

Question

I use Readability API to retrieve main content from given url, it works fine if the target url is encoded with UTF-8, but when target url is encoded in GB2312(one of Chinese encoding), I get rubbish information instead(the Chinese characters are wrongly encoded but English letters and digits work fine).

Deep Research

I inspected the HTTP header Readability API returns, it indicates that the encoding of API response is UTF-8.

Here's a snippet of wrongly encoded Chinese characters:

ÄÉ´ï¶û¾ø¾³Ï´󷴻÷¾Ü¾øÀäÃÅÄæת½ú¼¶ÖÐÍøËÄÇ¿

Length: 42

Which originally are:

纳达尔绝境下大反击拒绝冷门逆转晋级中网四强

Length: 21

However, if you convert the correct Chinese into unicode, it should be:

纳达尔绝境下大反击拒绝冷门逆转晋级中网四强

Tried But Not Working

iconv("GB2312", "UTF-8", $str);
iconv("GBK", "UTF-8", $str);
iconv("GB18300", "UTF-8", $str);
mb_convert_enconding($str, "UTF-8", "GB2312");
mb_convert_enconding($str, "UTF-8", "GB18300");
mb_convert_enconding($str, "UTF-8", "GBK");

Solution Requested

Since Readability API doesn't provide a parameter for charset of target url( I call this API like https://www.readability.com/api/content/v1/parser?url=http://sports.sina.com.cn/t/2013-10-04/14596813815.shtml&token=my_token_here), I have to do the convertion when handling the API response.

I will appreciate it very much if you have any idea about this issue.

Environment Info: PHP 5.3.6

1

1 Answers

4
votes

It seems that the individual bytes that make up the characters have been encoded as HTML numeric entities as if they were characters from ISO-8859-1 or some other 8-bit encoding. To undo the numeric entity encoding you can use mb_decode_numericentity:

$str = "ÄÉ´ï¶û¾ø¾³Ï´󷴻÷¾Ü¾øÀäÃÅÄæת½ú¼¶ÖÐÍøËÄÇ¿";

$str = mb_decode_numericentity($str, array(0, 255, 0, 255), "ISO-8859-1");

echo iconv("gb2312", "utf8", $str);

This produces the expected output of 纳达尔绝境下大反击拒绝冷门逆转晋级中网四强.