2
votes

I'm using python to build an application which functions in a similar way to an RSS aggregator. I'm using the feedparser library to do this. However, I'm struggling to get the program to correctly detect if there is new content.

I'm mainly concerned with news-related feeds. Besides seeing if a new item has been added to the feed, I also want to be able to detect if a previous article has been updated. Does anybody know how I can use feedparser to do this, bearing in mind that the only compulsory item elements are either the title or the description? I'm willing to assume that the link element will always be present as well.

Feedparser's "id" attribute associated with each item seems to simply be the link to the article so this may help with detecting new articles on the feed, but not with detecting updates to previous articles since the "id" for those will not have changed.

I've looked on previous threads on stackoverflow and some people have suggested hashing the content or hashing title+url but I'm not really sure what that means or how one would go about it (if indeed it is the right approach).

1

1 Answers

4
votes

Hashing in this context means to calculate a shorter value to represent each combination of url and title. This approach works when you use a hash function that ensures the odds of a collision (two different items generate the same value) are low.

Traditionally, MD5 has been a good function for this (but be careful not to use it for cryptographic operations - it's deprecated for that purpose).

So for example.

>>> import hashlib
>>> url = "http://www.example.com/article/001"
>>> title = "The Article's Title"
>>> id = hashlib.md5(url + title).hexdigest()
>>> print id
785cbba05a2929a9f76a06d834140439
>>> 

This will provide an id that will change if the URL or title changes - indicating that it is a new article.

You can download and add the content of the article to the hash if you also want to detect edits to the article content.

Note, if you do intend to pull entire pages down, you may want to learn about HTTP conditional GET with Python in order to save bandwidth and be a little friendlier to the sites you are hitting.