10
votes

I've been searching around trying to find an answer both here and google, although I've found some pointers I haven't quite found a solution.

If you have a simple RSS reader with a database, you might have a couple of tables for storing feeds (ignoring dealing with subscribers here):

  • Feeds (feed-id, feed-title, feed-url)
  • Items (item-id, feed-id, item-title, item-content)

This works in most cases but for many websites/web based applications you might have a main feed from the frontpage and then category feeds, if you take both into the above sort of system there's going to be a lot of replicated data due to the same post appearing in several rss feeds.

The two choices I've come up with are either ignore it and accept the duplicates or use a link table between the feeds and the items. But this also seems like quite a waste when probably 80% of the sort of feeds I'm looking to pull won't have multiple feeds which could create this replication.

Is there a better way of doing this/am I looking at this in completely the wrong way?

Update

Thanks to both for the answers, so the consensus seems to be that the saving on space is probably not significant enough to worry about and would be negated by the potential for unknown issues (such as mentioned by dbr).

Adding a link table or similar would probably increase the processing time as well so overall not worth worrying about too much. I had thoughts after reading the responses of linking content and removing duplicates only when the post is no longer in either RSS feed to save on space but again as Assaf has said, the space savings could make this a waste of time.

2

2 Answers

4
votes

I would suggest you don't try to optimize away every possible copy of feed data at this stage of development (design, I presume). Concentrate on getting it working and when you're done, if you do some profiling and find that you can indeed save X% of storage if you use links or shared data between feeds, only then and if X is large enough to pay for the time it would take to optimize your DB would I suggest you implement any such more advanced schemes.

3
votes

As Assaf said, I wouldn't worry about storing duplicated articles if they come from different feeds, for now at least. The complication it would add doesn't benefit the few kilobytes of space you'd save..

I suppose if you take a sha1 hash of the content, do SELECT id FROM articles WHERE hash = $hash and if something exists, simply have a "article_content_id" which if set points the articles content at another row... but, what if you have two articles:

id: 1
title: My First Post!
feed: Bobs site
content: Hi!
hash: abc
link: no
content_link_id:

id:2
title: My First Post!
feed: Planet Randompeople Aggregator
content:
hash: abc
content_link_id: 1

..this works fine, and you've saved 3 bytes by not duplicating the article (obviously more if the article was longer)

..but what happens when Bob decides to add adverts to his RSS feed, changing the content from Hi! to Hi!<p><img src='...'></p> - but Planet Randompeople strips out all images. Then to update a feed item, you have to check each row that content_link_id-links against the article you are updating, check if the new item has the same hash as the articles that link against it - if it is different, you have to break the link and copy the old data to the linking-item, then copy the new content to the original item..

There's possibly neater ways to do that, but my point is that it can get very complicated, and you will probably only save a few kilobytes (assuming the database engine doesn't do any compression itself) on a very limited subset of posts..

Other than that, having a table of feeds and items seems sensible, and is how most other RSS-storing databases I've seen dealt with it..