No-SQL (Cassandra) data modelling of user data

Question

How do you model user data in Cassandra?

A single table for user data, partitioned by user-ID, with different components reading/writing to different columns?
Multiple tables (one per component) with the same key structure, that occasionally need to be "joined" together on partition key?

We have various data and metadata associated with a customer, that we currently hold in separate tables with the same partitioning & clustering keys.

This leads to fething bits of information for a user from different tables (e.g. to analytics), effectively "joining" two or more Cassandra tables on their partition keys.

On the positive side, inserting to tables is done independently.

Is there a race condition when concurrently updating data under the same partition key but different columns? Or the deltas are gracefully merged on the SSTables?

Is having multiple tables with the same partition (and clustering) keys usual or an anti-pattern?

To make this more concrete, let's say we have:

CREATE TABLE example (
  pk text PRIMARY KEY
  col_a text
  col_b text
)

Assume that for a given partition key (pk), initially both col_a, and col_b have some value (i.e. not null). and two concurrent inserts update each of them. Is there any race condition possible there? Losing one of the two updates, despite writing into separate columns?

Pedro Vidigal Pedro Vidigal · Accepted Answer · 2018-05-21T16:09:46

Summary

Write conflicts are something you shouldn't need to worry about. All INSERTS/UPDATES/DELETES are Upserts in Cassandra. Everything in Cassandra is column based.

Cassandra uses a last-write wins strategy to manage conflict. As you can see by be example below, whenever you change a value the timestamp associated with that column is updated. Since you are running concurrent updates, and one thread will update col_a and another will update col_b.

Example

Initial Insert

cqlsh:test_keyspace> insert into race_condition_test (pk, col_a, col_b ) VALUES ( '1', 'deckard', 'Blade Runner');
cqlsh:test_keyspace> select * from race_condition_test ;

 pk | col_a   | col_b
----+---------+--------------
  1 | deckard | Blade Runner

(1 rows)

Timestamps are the same in the initial insert

cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;

 pk | col_a   | writetime(col_a) | col_b        | writetime(col_b)
----+---------+------------------+--------------+------------------
  1 | Deckard | 1526916970412357 | Blade Runner | 1526916970412357

(1 rows)

Once col_b is uptated, it's timestamp changes to reflect the change.

cqlsh:test_keyspace> insert into race_condition_test (pk, col_b ) VALUES ( '1', 'Rick');
cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;

 pk | col_a   | writetime(col_a) | col_b | writetime(col_b)
----+---------+------------------+-------+------------------
  1 | Deckard | 1526916970412357 |  Rick | 1526917272641682

(1 rows)

After col_a is updated it too get's its timestamp updated to the new value

cqlsh:test_keyspace> insert into race_condition_test (pk, col_a) VALUES ( '1', 'bounty hunter');
cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;

 pk | col_a         | writetime(col_a) | col_b | writetime(col_b)
----+---------------+------------------+-------+------------------
  1 | bounty hunter | 1526917323082217 |  Rick | 1526917272641682

(1 rows)

Recommendation

My recommendation is that you use one single table that serves your query needs. If you need to query by pk, then create one single table with all columns you need. This way you will have a single wide row that can be read back efficiently, as part of a single query.

The datamodel you describe in option 2 is a bit to relational, and is not optimal for Cassandra. You cannot perform joins natively in cassandra and you should avoid preforming joins on the client side.

Data Mode Rules:

Rule 1: Spread data Evenly across the cluster You will want to create a partition key that will ensure the data is evenly distributed across the cluster and you don't have any hotspots.

Rule 2: Minimize the number of partitions Read Each partition may reside in different nodes, so you should try to create a scenario where your queries go ideally to only one node for performance sake.

Rule 3: Model around your queries

Determine what queries to support
Create a table that satisfies your query (meaning that you should use one table per query pattern).
If you need to support more query patterns, then denormalize your data into additional tables that serve those queries. Avoid Secondary Indexes and Materialized Views, as they are not stable at the moment and the first one can create major performance issues when you start to increase your cluster.

If you want to read a little bit more about this I suggest this datastax page: Basic Rules of Cassandra Data Modeling

No-SQL (Cassandra) data modelling of user data

1 Answers

Summary

Example

Recommendation