1
votes

I have a two machine cluster which is running Cassandra 1.2.6. I am using a keyspace which has a replication factor of 2. But my application demands me to write to both the replicas in parallel and also let the Cassandra do the replication and hoping that Cassandra does not duplicate the key/value on the replica nodes.

For example:

  • I have nodes Node1 and Node2. I have a keyspace which has replication factor 2 configured on it and a column family to push key/value pairs
  • I use a python client (pycassa) to write to the cluster.
  • A key, "KeyX", hashes to Node1 and Node2. (I find out which key hashes to which servers through the node tool command. (`$nodetool getendpoints KeyspaceName ColumnFamilyName KeyHexString`)
  • I use a client to write (KeyX, Value) concurrently to the nodes Node1 and Node2. (In the connection pool I give only the specific server name)
  • When writing, I wait for one write to succeed (to the master). (Consistency level ONE)
  • Now, I monitor through the `$nodetool status` command the amount of disk space that the cluster uses.
  • I write around 100 keys each having 2MB values.

Ideally this should store around 400MB on disk with some overhead for storing keys which should be marginal compared to the value sizes that I using.

Observations:

  • If I do not write to all the nodes that the key hashes to, Cassandra internally handles replication and the data size is around 400MB. (200MB on each node for 100 keys with 2MB value)
  • If I do write to all the nodes the key hashes to, Cassandra is writing more than the expected amount of data to the disk. It is as high as 15% more. In our tests Cassandra write ~460MB instead of 400MB.

My question is, is the behavior (15% overhead) expected? Is there any configuration that we need to tweak so that Cassandra properly handles concurrent writes to all the replicas.

Thanks!

1

1 Answers

2
votes

There are two possible causes of the 15% extra space that I can think of.

One is because sometimes a replica will store two copies of a column temporarily. If you write a column twice in Cassandra at slightly different times, the two copies may go into separate memtables so end up in separate SSTables on disk. At some point later, when the SSTables get merged through the compaction process, the older value will be discarded, freeing up the space. In your test you could run nodetool compact to force compaction and see if the space usage goes down.

Another possible cause depends on how you did the test when you didn't write to both nodes. If you did this at consistency level ONE, it is possible some of the writes were dropped by the other replica, so it doesn't have all the keys yet. You can be sure it does by running nodetool repair. So the space used in your first observation may not be for all the keys.

You should be aware that writing to all replicas at consistency level ONE does not guarantee that each replica holds a copy. The node that is receiving the data does not have to store it to return success for the write, even if it is a replica. It may be overloaded (in your workload, this would most likely be due to not enough I/O to write the data out) and drop the write, while succeeding in writing it to a different replica. This would cause less space to be used in your second observation, but probably isn't happening in your test since it is a relatively small amount of data.

If you need to guarantee you have two copies you should write at consistency level ALL and only write it once.