8
votes

I just read the DataStax post "Basic Rules of Cassandra Data Modeling" and, to sum up, we should modeling our database schema by our queries and not by our relations/objects. So, many tables can have the same duplicated data, for example users_by_email and users_by_username which both have the same data.

How can I handle the object update ?
For example the user edit his email, do I UPDATE both tables manually or only INSERT the object with all columns and don't care about previous data (which are still in my database, but with a wrong column value => email).

In case of UPDATE, how can I handle data synchronization ?
Currently, I'm doing it manually but is there a tool to help me ? Because, possibly, I can have 5 or 6 tables with different partition/clustering keys.
I heard that Hadoop can do it, or Apache Spark.

3

3 Answers

3
votes

In Cassadnra, given an existing record, an update or insert using the same primary key will result in the old record marked for deletion (with a tombstone) and the new record becomes "live". There're few subtleties in the difference between Insert and Update, like counters and null values, but those are probably not relevant for the question.

Up to Cassandra 3.0, the responsibility of maintaining several views of the same data in sync is in hands of the client application. And yes, it means to insert/update the new data in all the different tables that require it.

Cassandra 3.0 introduced "Materialized Views", which lets you maintain a "master" table of the data and several views on it, all managed by Cassandra. It requires careful data modelling so that the primary key of the 'master' table contains the required entities to create the different views and related queries needed.

One additional note: If you find that your data is highly relational and requires several/many views to make it query-able, maybe Cassandra is not a good fit for the problem space and probably you should consider a RDBMS instead.

To extend on the example provided, probably user information is something we would like to keep in a relational DB, while high volume actions of those users could be registered in Cassandra. (purchases, clicks, heart rate samples, ...)

4
votes

To ensure data consistency across your many tables containing the same data, but laid out differently, it's recommended that you use a LOGGED BATCH in CQL to do the update. This way your CQL statements (updating data) in your BATCH are ACID, and you don't have to worry about some failing and retrying.

Using the linked article's schema it would look like:

BEGIN BATCH
  INSERT INTO users_by_email (email, username, age) VALUES ('[email protected]', 'fromanator', 24);
  INSERT INTO users_by_username (email, username, age) VALUES ('[email protected]', 'fromanator', 24);
APPLY BATCH;

This whole statement is Atomic, if one insert fails, they all fail and no change was made.

1
votes

What I've done in my system is have a unique identifier for each user.

I use one table of email / identifier (and some other data). When a user logs in or uses the system, I use his email to find the identifier, then everything else uses that identifier.

The user can now change his email address, the identifier stays the same so all the other tables do not require an UPDATE for such a change.

In regard to the old email address, I have not done it all yet, but I plan to have the current email reference the old one (a "link", if you'd like) and after a certain amount of time, maybe 12 months, the old email will get deleted. For those 12 months, the account is blocked (no one else can reuse that account.) This is a good idea for various security reasons.

P.S. For a unique identifier, people use different solutions, such as Zookeeper, I personally liked to use Cassandra with the Lamport's bakery algorithm.