0
votes

Hypothetically speaking, I plan to build a distributed system with cassandra as the database. The system will run on multiple servers say server A,B,C,D,E etc. Each server will have Cassandra instance and all servers will form a cluster.

In my hypothetical distributed system, X number of the total servers should process user requests. eg, 3 of servers A,B,C,D,E should process request from user uA. Each application should update its Cassandra instance with the exact copy of data. Eg if user uA sends a message to user uB, each application should update its database with the exact copy of the message sent and to who and as expected, Cassandra should take over from that point to ensure all nodes are up-to date.

How do I configure Cassandra to make sure Cassandra first checks all copies inserted into the database are exactly the same before updating all other nodes

Psst: kindly keep explanations as simple as possible. Am new to Cassandra, crossing over from MySQL. Thank you in advance

1

1 Answers

3
votes

Every time a change happens in Cassandra, it is communicated to all relevant nodes (Nodes that have a replica of the data). But sometimes that doesn't happen either because a node is down or too busy, the network fails, etc.

What you are asking is how to get consistency out of Cassandra, or in other terms, how to make a change and guarantee that the next read has the most up to date information.

In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. There are multiple consistency options but normally you would only use:

ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (If you write to A, someone can read from B while it was not updated).

QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered. (If you have 3 nodes ABC and you write to A and B, someone can read from C but also from A or B, meaning it will always get the most up to date information).

Cassandra knows what is the most up to date information because every change has a timestamp and the most recent wins.

You also have other options such as ALL, which is NOT RECOMMENDED because it requires all nodes to be up and available. If a node is unnavailable, your system is down.

Cassandra Documentation (Consistency)