3
votes

We are using Datastax Cassandra for our social network and we are designing/data modeling tables we need, it is confusing for us and we don't know how to design some tables and we have some little problems!

As we understood for every query we have to have different tables, and for example user A is following user C and B.

Now, in Cassandra we have a table that is posts_by_user:

user_id      |  post_id       |  text  |  created_on  |  deleted  |  view_count  

likes_count  |  comments_count  |  user_full_name

And we have a table according to the followers of users, we insert the post's info to the table called user_timeline that when the follower users are visiting the first web page we get the post from database from user_timeline table.

And here is user_timeline table:

follower_id      |      post_id      | user_id (who posted)  |  likes_count  |  

comments_count   |   location_name   |  user_full_name

First, Is this data modeling correct for follow base (follower, following actions) social network?

And now we want to count likes of a post, as you see we have number of likes in both tables (user_timeline, posts_by_user), and imagine one user has 1000 followers then by each like action we have to update all 1000 rows in user_timeline and 1 row in posts_by_users; And this is not logical!

Then, my second question is How should it be? I mean how should like (favorite) table be?

1

1 Answers

5
votes

Think of using posts_by_user as metadata for a post's information. This would allow you to house user_id, post_id, message_text, etc, but you would abstract the view_count, likes_count, and comments_count into a counter table. This would allow you to fetch either a post's metadata or counters as long as you had the post_id, but you would only have to update the counter_record once.

DSE Counter Documentation: https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_counter_t.html

However,

The article below is a really good starting point in relation to data modeling for Cassandra. Namely, there are a few things to take into consideration when answering this question, many of which will depend on the internals of your system and how your queries are structured. The first two rules are stated as:

Rule 1: Spread Data Evenly Around the Cluster

Rule 2: Minimize the Number of Partitions Read

Taking a moment to consider the "user_timeline" table.

  1. user_id and created_on as a COMPOUND KEY* - This would be ideal if

    • You wanted to query for posts by a certain user and with the assumption that you would have a decent number of users. This would distribute records evenly, and your queries would only be hitting a partition at a time.
  2. user_id and a hash_prefix as a COMPOUND KEY* - This would be ideal if

    • You had a small number of users with a large number of posts, which would allow your data to be evenly spread across the cluster. However you run the risk of having to query across multiple partitions.
  3. follower_id and created_on as a COMPOUND KEY* - This would be ideal if

    • You wanted to query for posts being followed by a certain follower. The records would be distributed and you would minimize queries across partitions

These were 3 examples for 1 table, and the point I wanted to convey is to design your tables around the queries you want to execute. Also don't be afraid to duplicate your data across multiple tables that are setup to handle various queries, this is the way Cassandra was meant to be modeled. Take a bit to read the article below and watch the DataStax Academy Data Modeling Course, to familiarize yourself with the nuances. I also included an example schema below to cover the basic counter schema I was pointing out earlier.

* The reason for the compound key is due to the fact that your PRIMARY KEY has to be unique, otherwise an INSERT with an existing PRIMARY KEY will become an UPDATE.

http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling https://academy.datastax.com/courses

CREATE TABLE IF NOT EXISTS social_media.posts_by_user (
user_id uuid,
post_id uuid,
message_text text,
created_on timestamp,
deleted boolean,
user_full_name text,
PRIMARY KEY ((user_id, created_on))
);
CREATE TABLE IF NOT EXISTS social_media.user_timeline (
follower_id uuid,
post_id uuid,
user_id uuid,
location_name text,
user_full_name text,
created_on timestamp,
PRIMARY KEY ((user_id, created_on))
);
CREATE TABLE IF NOT EXISTS social_media.post_counts (
likes_count counter,
view_count counter,
comments_count counter,
post_id uuid,
PRIMARY KEY (post_id)
);