3
votes

I am building a feed (rss, twitter, other services, etc) aggregator with LAMP. It is very similar to Google Reader in that people can add as many feeds as they want and then be able to read their feeds, sort them, view individual feeds, or groups of feeds at once.

I have built this type of service before but for a small, limited group of people where the ENTIRE group had access to all of the aggregated feed items. So, it was pretty straightforward.

This time however, I am building a service to which people can subscribe so I will have potentially (ideally) thousands of users and 10's of thousands of feeds, and in-turn, millions of feed items.

My approach for the database schema (simplified) is this:

users (id, name, ...)
feeds (id, name, url, ...)
feed_items (id, title, timestamp, feed_id, ... )
user_feeds (id, user_id, feed_id, ...)

However, being that a user can subscribe to 100's of feeds I am trying to plan out the best and most optimized way to query the database for the feed_items of feeds they have subscribed to (or subsets of).

2

2 Answers

2
votes

I think you're on the right track. I've done this before (a couple times), and the important thing is to figure out which table needs to contain which information. For example, in my USERS table, I keep the cache copy of the user's subscription list (or OPML). If you're going to allow users to keep track of read/unread status of each article, you'll probably want to keep that metadata in a separate table. Conversely, I see you've set up a relational table just for user<->feed relationships. That allows you to keep just one copy of each feed in the FEED table, but the tradeoff in query complexity (and performance) may not be worth it. Consider what queries you expect to run.

For example, my users' main "home page" is a list of "folders" (i.e., Google Reader labels) into which feeds are segregated, with each folder labeled with the number of unread articles in that folder (not counting duplicate articles). Even with good indexes, that is a bear of a query (and slow) using the relational approach. But if you denormalize it (i.e., the FEEDS table may contain multiple copies of each feed, and the schema includes the user_id (and, in my case, folder name)), the table is bigger, but that query is easy and instantaneous.

Also, in my POSTS table (or FEED_ITEMS -- whatever), I stow the original article description/content:encoded in the DESCRIPTION_ORIGINAL column, then put a "clean" version in the DESCRIPTION column. The clean version is HTML sanitized, ads removed, known encoding issues fixed, etc.

0
votes

Caching is of incredible use here - You can, when a user edits their feeds, execute the feed query and store the results in memcache.

Then you can just do a WHERE (feed_items.feed_id IN ( ... )), though I recommend you cache the results of those queries as well.