2
votes

I'm currently working on an application that uses php's simplexml to interpret rss feeds from multiple sites. I've managed to get this to work with both Atom and RSS feeds, however one feed only works sometimes, which doesn't really make sense. The feed is located http://www.popsci.com/rss.xml. I'm wondering if the feed itself is malformed, or if there may be something that it is doing that is valid, but unusual. I've looked it up and down but can't find anything wrong. My code interprets many other feeds perfectly, so I'm wondering why this one gives it trouble. And only sometimes. What I mean by that, is that sometimes it will successfully interpret the feed, but other times, the call to simplexml_load_file() fails and returns false.

I do have an error log and this is what has been reported (the same thing every time it fails):

[08-Nov-2010 03:30:17] PHP Warning: simplexml_load_file() [function.simplexml-load-file]: http://www.popsci.com/rss.xml:1: parser error : Start tag expected, '<' not found in /Applications/MAMP/cmb/cron.php on line 15

[08-Nov-2010 03:30:17] PHP Warning: simplexml_load_file() [function.simplexml-load-file]: � in /Applications/MAMP/cmb/cron.php on line 15

[08-Nov-2010 03:30:17] PHP Warning: simplexml_load_file() [function.simplexml-load-file]: ^ in /Applications/MAMP/cmb/cron.php on line 15

But I'm not sure what to make of these errors, and I would appreciate it if someone could point me in the right direction. Thanks! (P.S. line 15 in cron.php is the call to simplexml_load_file())

2
The feed is good. If I pull the file down with curl and try php -r 'echo simplexml_load_file( "rss.xml" )->channel->title;' it prints the title. If I use the URL, in double quote it gives your errors. With URL in single quotes, it works. I'm not a PHP expert: I don't know why this makes a difference. Maybe someone can explain.Steven D. Majewski
Nevermind: the quotes don't make a difference. It varies with time. It seems to have something to do with the compression: sometimes it's gzipped and sometimes it appears not.Steven D. Majewski

2 Answers

1
votes

The chances are that there are unescaped characters in the feed, the usual suspects are the & character, the " character and the < and > characters, all of which have special meaning in XML, as I'm sure you're aware.

Unfortunately, the only thing you can actually do about it is complain to the feed publisher about dodgy data in their feed, because XML parsers are explicitly forbidden from attempting to parse malformed XML. They're supposed to terminate and throw an error.

1
votes

A good way to handle crummy feeds is to run them through feed burner.

It's a google tool that will standardise the output for you and has fixed errors for me in the past.

  1. Paste in the feed here: http://feedburner.google.com/fb/a/myfeeds *you may need to login to your google account
  2. Click next, follow the prompts and it will give you a url
  3. replace that url in your script

Hope that works for you :)

Jase