Cassandra SSTables and Compaction

Question

So I was researching Cassandra and trying to get an understanding of the architecture, and I was reading the following page from the wiki: http://wiki.apache.org/cassandra/MemtableSSTable

So, to follow the workflow here, you send a request to update your table, this request is written into a CommitLog, then into an in-memory table called a Memtable (Which can be rebuilt from the Commitlog in case of system failure). Once the Memtable hits a certain size, it flushes the entire Memtable to an on disc SSTable which can no longer be modified only merged during compaction. When you reach a configurable number of SSTables you do compaction, which basically merges the results freeing up disc space and creating a single new and improved up to date SSTable. Correct me please if I've understood anything wrong here.

Now I have a few questions about compaction. Firstly, how expensive is this operation? If I demanded a compaction whenever we have two SSTables on disc, would this be prohibitive, or would I be better served waiting until the middle of the night when usage is down? Is compaction any better if I have multiple (but small) SSTables vs having a few but very large SSTables? Does having a lot of non-compacted SSTables affect read performance? How does concurrency work with this: what if I'm reading from these SSTables, then someone does an insert which flushes a new Memtable to disk, which in turn causes a compaction?

Any info and experience you could provide about this would be great!

tom.wilkie tom.wilkie · Accepted Answer · 2012-01-18T23:42:50

Trying to answer each question:

Firstly, how expensive is this operation?

A compaction has to copy everything in the SSTables it is compacting (minus any annihilations from tombstones or overwrites). However this is less expensive than it seems at first, as compaction use purely sequential IO, which is nice and fast on spinning disks.

If I demanded a compaction whenever we have two SSTables on disc, would this be prohibitive, or would I be better served waiting until the middle of the night when usage is down?

This would mean your writes would get significantly more costly; imagine each write causes a new SSTable; each write would therefore have to compact will all writes that have come before it. The cost of writing N items would be N^2.

A better idea is adopt a compaction strategy like the one used by Acunu's Doubling Array: store each SSTable (aka array) in a "level" and compact them whenever there are two arrays in a level, promoting the output array to the next level. This can be shown to amortise to O((log N) / B) sequential IOs per write, whilst limiting the number of arrays to O(log N).

This scheme is implemented in Castle, an (open-source) storage engine for Cassandra. For more information, see here:

NB I work for Acunu

Is compaction any better if I have multiple (but small) SSTables vs having a few but very large SSTables?

Compaction with smaller SSTables will take less time, but you will have to do more of them. Its horses-for-courses, really. SSTable count & size does affect read performance, however (see next question)

Does having a lot of non-compacted SSTables affect read performance?

For point-reads, not very much: Cassandra (and Castle) have bloom filters to avoid looking in SSTables when it know the key won't be there, and can terminate early when it finds the right value (by using timestamps on the values and SSTables).

However, with get_slice queries, you cannot terminate early, so you will have to visit every SSTable that could possibly contain a value in your row - therefore if you have lots, your get_slices will be slower.

The situation is even worse for get_range_slices, where you cannot use bloom filter, and each call has to visit every SSTable. Performance of these calls will be inversely proportional to the number of SSTables you have.

Whats more, with thousands of SSTables, the bloom filter false positive rate (~1%) will start to hurt, as for every look up you will have to look in 10s of SSTables that don't contain the value!

How does concurrency work with this: what if I'm reading from these SSTables, then someone does an insert which flushes a new Memtable to disk, which in turn causes a compaction?

In Cassandra the SSTables are deleted of disk once there are no more references to it in memory (as decided by the garbage collector). So reads don't need to worry, and old SSTables will get cleared up lazily.

Thanks

Tom

Cassandra SSTables and Compaction

2 Answers