So I was researching Cassandra and trying to get an understanding of the architecture, and I was reading the following page from the wiki: http://wiki.apache.org/cassandra/MemtableSSTable
So, to follow the workflow here, you send a request to update your table, this request is written into a CommitLog, then into an in-memory table called a Memtable (Which can be rebuilt from the Commitlog in case of system failure). Once the Memtable hits a certain size, it flushes the entire Memtable to an on disc SSTable which can no longer be modified only merged during compaction. When you reach a configurable number of SSTables you do compaction, which basically merges the results freeing up disc space and creating a single new and improved up to date SSTable. Correct me please if I've understood anything wrong here.
Now I have a few questions about compaction. Firstly, how expensive is this operation? If I demanded a compaction whenever we have two SSTables on disc, would this be prohibitive, or would I be better served waiting until the middle of the night when usage is down? Is compaction any better if I have multiple (but small) SSTables vs having a few but very large SSTables? Does having a lot of non-compacted SSTables affect read performance? How does concurrency work with this: what if I'm reading from these SSTables, then someone does an insert which flushes a new Memtable to disk, which in turn causes a compaction?
Any info and experience you could provide about this would be great!