0
votes

I'm using simplexml_load_file method for parsing feed from external source.

My code like this

$rssFeed['DAILYSTAR'] = 'http://www.thedailystar.net/latest/rss/rss.xml'; $rssParser = simplexml_load_file($url);

The output is as follows :

Warning: simplexml_load_file() [function.simplexml-load-file]: http://www.thedailystar.net/latest/rss/rss.xml:12: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x92 0x73 0x20 0x48 in C:\xampp\htdocs\googlebd\index.php on line 39

Ultimately stop with a fatal error. Main problem is the site's character encoding is ISO-8859-1, not UTF-8.

Can i be able to read this using this method(SimpleXML API)? If no then any other method is available? I've searched through Google but no answer. Every method I applied returns with this error.

Thanks, Rashed

2

2 Answers

0
votes

Well, well, when I retrieve this content using Python, I get the following:

'\n<rss version="2.0" encoding="ISO-8859-1">\n [...]
<description>The results of this year\x92s Higher Secondary Certificate 

Now it says it's ISO-8859-1, but \x92 is not in that character set, but instead is the closing curly single quote, used as an apostrophe, in Windows-1252. So the page throws an encoding error, and as per the XML spec, clients should be "strict" and not fix errors.

You can retrieve it, and filter out the non-ISO-8859-1 characters in some fashion, or better, convert the encoding using mb-convert-encoding() before passing the result to your RSS parser.

Oh, and if you want to incorporate the result into a UTF-8 page, you may have convert everything to UTF-8, though this is English, which might not even require any different character encodings, if all turns out to be ASCII after all.

0
votes

We ran into the same issue and used utf8_encode to change the encoding from ISO-8859-1/latin-1 to UTF-8 and get past the error.

$contents = file_get_contents($url);
simplexml_load_string(utf8_encode($contents));