Cassandra performance: split CF or not?

Question

I'm working on the design of a Cassandra database to learn about it. But I have a question I would like some expert help me to clarify:

I have read that the rows of each column family are distributed through the nodes, thus each node has a part of the rows of a given column family. Does it mean that it is not a good idea to divide a column family into many column families even when that column family has millions of rows?

My experience with RDBMS says that is better to split very big tables into smaller tables to get a better performance, but it seems that in Cassandra there is no need of this and, even more, if I have many column families I would need more memory. Am I right? Is it better keeping many rows in a column family to get a better performance than split the column family in many?

Thanks!

rs_atl rs_atl · Accepted Answer · 2013-02-01T19:51:22

There is no need to shard column families in Cassandra. You can put as much data in one CF as you have storage space and machines to store it. One thing to consider, however, is that you will get better performance with many smaller machines than with a few machines with really big drives. And you do NOT want to put all that data on shared storage. Cassandra gets its speed through parallel sequential reads and writes.

One thing you DO want to watch out for is unbounded row growth--i.e. adding columns to a row in an unbounded way. This is a pretty easy problem to solve by sharding keys if necessary. But even then, you can write millions of columns in a row.

Cassandra performance: split CF or not?

1 Answers