1
votes

I would like to implement a real time content ranking system on cassandra 1.2 CQL 3.0

Example : Providing the top 10 comments on a news based on user vote

  • user can comment a news
  • user can vote +1 or -1 on each comment

Maybe 100, 1000, 10k, 100k comments by content

So what i need is to get real time (or 5min delay) the top 10 comments based on vote score

[real example youtube comment]

My solution is :

  • storing vote counter on counter table (content_counter)
  • create a rank table

create table content_comment_rank ( content_id timeuuid, score bigint, com_id timeuuid, PRIMARY KEY (content_id, score, com_id) ) WITH CLUSTERING ORDER BY (score DESC)

  • all data of a content_id is stored on 1 ROW ordered by score desc
  • I only have i this table comment with at least 1 vote

So what happen when a user vote

  • SELECT score FROM content_counter where content_id='x' (R: 123)
  • UPDATE content_counter SET score=score+1 (124) (or -1)

IN BATCH :

  • DELETE content_comment_rank WHERE content_id='x' AND score=123 AND com_id='y'
  • INSERT content_comment_rank (content_id,score,com_id) VALUES ('x',124,'y')

So having top 10 comments of a content is

SELECT * FROM content_comment_rank WHERE content_id='x' LIMIT 10

It's pretty easy BUT i'm not sure i'm doing well because :

  • How to deal with replication and Concurrent vote

    Maybe by updating 1 time in batch each day (It's like that vote on video youtube is working)

  • Maybe there is a better way....

We need "cluster ordering on counter table"..... (all done in 1 atom query) but doesn't existe (or not...)

What's your advice on my solution ?

Thank you,

1

1 Answers

1
votes

@Sebastien Kondov

You are duplicating counter value (score) in content_comment_rank table. You may have mismatches between the Cassandra distributed counter table content_counter and the counter value in table content_comment_rank

To mitigate that you may want to write and read in this table with QUORUM consistency level but it has a cost on performance, for sure.

Now, it depends on how critical it is for you to have mismatches in counting in real time. If it is acceptable to have some % error rate then your solution is perfectly suitable. Afterward, in the background, you may run batches to re-synchronize content_counter and content_comment_rank.

How to deal with replication and Concurrent vote

--> On concurrent vote, content_counter table will provides strong consistency. For content_comment_rank, the last write wins

Maybe there is a better way....

You can read this article: http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

The idea is:

  1. Have a real-time process to save votes in content_comment_rank table, and accept some errors in the counting.
  2. Have a batch in background that fixes any mismatch in counter, using content_counter table as source of truth to write data to another correct_content_comment_rank table. This batch process should be run frequently, but reading and fixing data only for the last past few hours
  3. For queries on top 10 comments, you can merge results between content_comment_rank and correct_content_comment_rank