4
votes

Lets say you're running a movie database website like IMDb/Netflix and users rate each movie from 1-10 star. When a user rate movie, I get id (long) and rating from 1-10 in the request. The Movie class looks like this.

class Movie
{
    long id;
    String name;
    double avgRating;     //Avg Rating of this movie
    long numberOfRatings; //how many times this movie was rated.
}

public void updateRating(long movieId, int rating)
{

    //code to update movie rating and update top 10 movie to show on page.
}

My question is what data structures I can choose to keep huge movies data in memory so that on each updateRating call, i update movie rating as well as update Top 10 movie and reflect on the webpage and users will always see the latest top 10 movies. I have a lot of space on web server and i can keep all the movies objects in memory. The challenges here are

1) Look up a movie by id.
2) update movie rating.
3) choose new location of this movie in the sorted collection of movies (sorted by ratings) and if its new position is in first top 10, show it on web page.


All these operations should be done in best optimal time.

this is not a homework but a general programming and data structure question.

4
Updating top 10 movies should not be done on every vote, but rather on a timed basis (hourly, daily, etc). - J.C. Inacio
Are you planning on serializing your objects? - CoolBeans
@Jcinacio - this is constraint of program to show to most current rating of each movie. take an example of fandago dot com where users buy tickets of newly released movies based on their rating. - Imran Amjad
@CoolBeans - its not required. Also, i don't need to update database on each vote cast. i can do it periodically. - Imran Amjad

4 Answers

5
votes

I'd personally use a relational database for this.

  1. Make a Movie table with an ID, and Name field, using the ID as the primary key (clustered)
  2. Make a Rating table with an ID, UserId, MovieId, and Rating field. Use the obvious foreign key references.
  3. Use an ORM to construct your Movie object based on a query across these tables.

But I suppose if you're looking at it purely from a data structures and algorithms standpoint, I'd begin by changing your Movie class to have a running ratingSum field, so that you can calculate the average on the fly. Then I'd create a list that maxes out at ten objects. Any time a rating is added, I would check to see if the new average for that movie is higher than the least of the items in the "top 10" list. If it is, then I'd insert it into the appropriate place in that list and drop the last item off the bottom of the list. Obviously, if it's already in the list then you only need to worry about reordering the existing items rather than removing one. This is a simple approach that would only have a tiny cost with each ratings update.

(A Linked List would probably give you the best performance for your "top 10" list, but with only 10 items that only get rearranged a few times a week at most, you probably wouldn't notice a difference.)

Obviously, you'll have to have all of the movies in a collection with quick lookup times (like a Hashtable) in order to find them by ID. Of course, with a zillion items, you're going to be hard pressed to fit all this into memory. Hence the Relational Database.

3
votes

It seems like there are two parallel structures here. First, you need a lookup table that can map from IDs to movies. Second, you need to maintain some sort of priority queue that can be used to track the top ten movies overall.

One way to solve this problem would be to simply maintain these two structures concurrently. Since you know that each movie has an integral ID, you could either store the movies in a giant array, or if you expect the IDs to be sparse in a hash table. Additionally, you could maintain a priority queue (perhaps backed by a binary or binomial heap) that stores all movies with priority equal to their rating. This would allow you to determine the top ten movies by dequeuing ten elements from the priority queue and then reinserting them.

However, to squeeze more performance out of your priority queue, I'd suggest using a slightly modified queue structure in which you have an array of the top ten movies in sorted order and a priority queue of all other movies that are not in the top ten. Whenever you update the priority of a movie, you could do the following:

  1. If the movie is in the top-ten array, remove it from that array and shuffle the elements after it up one spot. Then insert it into the priority queue with its new rating.

  2. Otherwise, use the priority queue's decrease-key function to reduce its key. If the rating is now higher than the tenth-most popular movie in the top ten list, remove that movie from the top ten list and insert it into the priority queue. Otherwise, we are done.

  3. (At this point, the element is now in the priority queue at its proper location, and the top ten movies array has nine elements in it)

  4. Use the priority queue's dequeue-max function to extract the most popular movie from the priority queue, then use a simple insertion sort to insert it into the array of the top ten most popular movies.

The overall time complexity for this approach (assuming you use a binary or binomial heap) is O(k2 + lg n), where k is the number of elements in the top-ten list and n is the total number of movies. On average, it runs in O(lg n) time, since chances are you don't need to update the top ten list. In either case, since k is small (ten), I'd assume that this would work very quickly. Moreover, it gives you O(1) lookup for any of the top k movies, which I expect will be a pretty common operation.

Hope this helps!

1
votes

If you need to access the entire data set sorted and I would suggest using a sorted tree and compare your items by rating.

If, however, you only need to view the top ten. Then you could use a sorted deque, and every time you update an item's rating add it to the deque and immediatly trim it to no more than 10 items (unless you use a bounded implementation, then that is done for you).

0
votes

To populate the top 10 list initially you'll have to make a pass over all the data. However, after that you could keep the rating of the #10 movie and, each time a vote is cast, update the top 10 only if the updated movie's rating is greater than or equal to the rating of #10. Anything less than that average rating would not affect the top 10.

Also, I'd store the data in a relational database as has already been suggested, and keep only the top 10 in memory.