Choosing data strucutures to sort TOP 10 items out of zillion items based on users rating

Question

Lets say you're running a movie database website like IMDb/Netflix and users rate each movie from 1-10 star. When a user rate movie, I get id (long) and rating from 1-10 in the request. The Movie class looks like this.

class Movie
{
    long id;
    String name;
    double avgRating;     //Avg Rating of this movie
    long numberOfRatings; //how many times this movie was rated.
}

public void updateRating(long movieId, int rating)
{

    //code to update movie rating and update top 10 movie to show on page.
}

My question is what data structures I can choose to keep huge movies data in memory so that on each updateRating call, i update movie rating as well as update Top 10 movie and reflect on the webpage and users will always see the latest top 10 movies. I have a lot of space on web server and i can keep all the movies objects in memory. The challenges here are

1) Look up a movie by id.
2) update movie rating.
3) choose new location of this movie in the sorted collection of movies (sorted by ratings) and if its new position is in first top 10, show it on web page.

All these operations should be done in best optimal time.

this is not a homework but a general programming and data structure question.

Updating top 10 movies should not be done on every vote, but rather on a timed basis (hourly, daily, etc). — J.C. Inacio
@Jcinacio - this is constraint of program to show to most current rating of each movie. take an example of fandago dot com where users buy tickets of newly released movies based on their rating. — Imran Amjad
@CoolBeans - its not required. Also, i don't need to update database on each vote cast. i can do it periodically. — Imran Amjad

StriplingWarrior StriplingWarrior · Accepted Answer · 2011-06-09T21:18:03

I'd personally use a relational database for this.

Make a Movie table with an ID, and Name field, using the ID as the primary key (clustered)
Make a Rating table with an ID, UserId, MovieId, and Rating field. Use the obvious foreign key references.
Use an ORM to construct your Movie object based on a query across these tables.

But I suppose if you're looking at it purely from a data structures and algorithms standpoint, I'd begin by changing your Movie class to have a running ratingSum field, so that you can calculate the average on the fly. Then I'd create a list that maxes out at ten objects. Any time a rating is added, I would check to see if the new average for that movie is higher than the least of the items in the "top 10" list. If it is, then I'd insert it into the appropriate place in that list and drop the last item off the bottom of the list. Obviously, if it's already in the list then you only need to worry about reordering the existing items rather than removing one. This is a simple approach that would only have a tiny cost with each ratings update.

(A Linked List would probably give you the best performance for your "top 10" list, but with only 10 items that only get rearranged a few times a week at most, you probably wouldn't notice a difference.)

Obviously, you'll have to have all of the movies in a collection with quick lookup times (like a Hashtable) in order to find them by ID. Of course, with a zillion items, you're going to be hard pressed to fit all this into memory. Hence the Relational Database.

Choosing data strucutures to sort TOP 10 items out of zillion items based on users rating

4 Answers