Cassandra row with few columns read performance degradation

Question

I have a Cassandra v1.2.5 performance degradation on reading data from a single row where only few or zero columns, but previously many different columns were added and deleted.

To test I do the following:

Create a fresh column family
Measure read speed of a row 100 times - 4.6 ms in average ms per read with zero column returned
Add 500000 columns to the row
Removed all 500000 from the row
Measure read speed 100 times again - 282.4 ms in average ms per read with zero column returned

So after that reading became in ~70 times slower than before I added and removed 500000 columns.

Tries to compact, flush, repair - nothing helps. Speed was a bit increased up-to 208.7 ms

The only thing that helps to restore read performance is to remove the row completely. Writing and reading to other rows are still fast.

Why does this read speed degradation happen? And how to fix?

Richard Richard · Accepted Answer · 2013-06-15T20:44:00

The degradation is because of tombstones. Cassandra can't just delete the columns, because if a replica didn't receive the delete, the columns would reappear when that node came back online. For this reason, Cassandra stores deletes as tombstones, which are just like values but with a marker saying the column is deleted.

The tombstones are deleted after gc_grace_seconds. By this time, it is assumed all replicas will have seen the delete so the tombstones can safely be removed. The default is 10 days. You can control it (per column family) - if in your use case you delete at consistency level ALL, or columns coming back to life doesn't matter too much, you could even lower it to 0.

Alternatively, if you want to delete a whole row, you can do a row delete rather than deleting individual columns. This inserts a row tombstone which, after compaction, means reading the row should be about as quick as if you had never inserted the now deleted columns.

Cassandra row with few columns read performance degradation

1 Answers