1
votes

I need to create a database of items from rss feeds and I'd like it to be updated asynchronously (via push ala AJAX) rather than pull (ala scraping RSS via python/magpie in php). The database will be used for analysis not for an application, so it needs to scale. If anyone knows of an rss feed reader application for which you can simply export the items from your feed via xml, that'd be great.

I'd prefer not to create a bunch of server infrastructure to get a php rss parser to play with mysql on chron jobs, but I will do so if necessary. Also interested in potential python solutions.

I've taken a look at Superfeedr/PubSubHubbub, but not sure that's the right solution for me.

2

2 Answers

4
votes

Please put a salt in my comment, because I created Superfeedr. I will try to remain objective anyway.

If you want to scale up, and want your dataset to keep growing overtime, it is likely that (as you've guessed) polling is going to be extremely hard. You will probably spend a lot of time handling HTTP issues and XML parsing issues. At Superfeedr, we are already fetching and parsing milions of feeds, and there is not a week where we don't encounter a new 'species' of error. I sometimes feel like the first setler in the amazonian rain forest.

Among HTTP issues, you will obviously see some services blocking you, because they find that your requests are too aggressive, but also you will have to deal with downtimes on several of these services, which can then slow your whole system down. Of course, I'm not talking about the ambiguities in dealing with HTTP headers (we know some servers who make a difference between Etag and etag, and some who will only accept etags with quotes... while others will refuse them...)

On the XML side, it's even worse. First you'll have to be able to parse so much soup that you could probably feed the world (pun intended!). XML seems to be a very complex science for a lot of web developers who forgot that escaping is a necessity, that namespaces have prefixes, but also that most <open> tag must eventually be </closed>. Now, you'll also have to deal with the flavors of RSS, ATOM, or RDF, and deal with them all.

Once you have identified the right format, you will also have to make sense of the data. I always quote the timestamps in feeds, because people tend to mess them up a lot. Some feeds that we found even show "yesterday" as the <published> date. How cute is that? Now, for those doing machine readable timestamps, you'll see some with just a numeric value, some other with 06/03/2012. Even if they use the right format (not specify in RSS specs!), it is not rare that people don't get how timezones work (yay for stuff published in the future!) or even daylight saving time. Finally, and that's actually a legit point: some feeds do not use our gregorian calendar, but the arabic calendar for example.

Diffing (identifying the new vs. the old stuff) is also incredibly hard, because timestamps are f****ed-up, but also because RSS 1.0 for example doesn't have the notion of <id>. Building one is hard, because a lot of peopel will put tracking code (changing!) in links, or even in the body of their entries :(

In a nutshell, polling is a mess and extremely hard to handle at scale. Now if you go down this path, please, do use the PubSubHubbub open protocol. It's a small step for you, but a huge step for webmanity, because it will grow adoption and, if all goes well, we may eventually be done with polling. The good news is that a LOT of platforms have adopted it, which should decrease significantly your polling needs.

What you're trying to build is not obvious to me, but I believe using a solution like Superfeedr is a good approach. We will deal with all the HTTP trouble, and will normalize the XML as much as we can so that it's easier for you to consume (we can turn it into JSON even). Yes, we charge for our service, but it's also a lot of times saved on your end so that you can focus on what makes your service/datastore really different from everybody else.

0
votes

Feeds are generally accessed through HTTP GETs so unless you're willing to use a 3rd party app (such as Superfeedr) there isn't really a push option.

Having said that, the pull option isn't too bad. It depends on how much data you want to aggregate. For "read and export" most desktop feedreaders might have a problem at quantity but something without all the GUI overhead like Venus would probably get you a long way.

The variation in feed formats and quality is a big issue, but there are libraries out there which can look after this - the Python feedparser is particularly good.

It doesn't take much code to set up feed polling and pushing the results through a parser into a DB. (If you do set up the polling yourself, be sure to check ETags and Last Modified headers - or let feedparser do it for you).