Cassandra table synchronization

Question

I just read the DataStax post "Basic Rules of Cassandra Data Modeling" and, to sum up, we should modeling our database schema by our queries and not by our relations/objects. So, many tables can have the same duplicated data, for example users_by_email and users_by_username which both have the same data.

How can I handle the object update ?
For example the user edit his email, do I UPDATE both tables manually or only INSERT the object with all columns and don't care about previous data (which are still in my database, but with a wrong column value => email).

In case of UPDATE, how can I handle data synchronization ?
Currently, I'm doing it manually but is there a tool to help me ? Because, possibly, I can have 5 or 6 tables with different partition/clustering keys.
I heard that Hadoop can do it, or Apache Spark.

maasg maasg · Accepted Answer · 2016-03-05T18:38:25

In Cassadnra, given an existing record, an update or insert using the same primary key will result in the old record marked for deletion (with a tombstone) and the new record becomes "live". There're few subtleties in the difference between Insert and Update, like counters and null values, but those are probably not relevant for the question.

Up to Cassandra 3.0, the responsibility of maintaining several views of the same data in sync is in hands of the client application. And yes, it means to insert/update the new data in all the different tables that require it.

Cassandra 3.0 introduced "Materialized Views", which lets you maintain a "master" table of the data and several views on it, all managed by Cassandra. It requires careful data modelling so that the primary key of the 'master' table contains the required entities to create the different views and related queries needed.

One additional note: If you find that your data is highly relational and requires several/many views to make it query-able, maybe Cassandra is not a good fit for the problem space and probably you should consider a RDBMS instead.

To extend on the example provided, probably user information is something we would like to keep in a relational DB, while high volume actions of those users could be registered in Cassandra. (purchases, clicks, heart rate samples, ...)

Cassandra table synchronization

3 Answers