data modeling of cassandra for node based use cases

Question

I have a cql table which has 2 columns

{

long minuteTimeStamp -> only minute part of epoch time. seconds are ignored.

String data -> some data

}

I have a 5 node cassandra cluster and I want to distribute per minute data uniformly on all 5 nodes. So if per minute data is ~10k records, so each node should consume ~2k data.

I also want to consume each minute data parallelly, means 5 different readers read data 1 on each node.

I came to one solution like I also keep one more column in table like

{

long minuteTimeStamp

int shardIdx

String data

partition key : (minuteTimeStamp,shardIdx)

}

By doing this while writing the data, I will do circular round-robin on shardIdx. Since cassandra uses vnodes, so it might be possible that (min0,0) goes to node0, and (min0,1) also goes to node0 only as this token might also belong to node0. This way I can create some hotspots and it will also hamper read, as 5 parallel readers who wanted to read 1 on each node, but more than one reader might land to same node.

How can we design our partition-key so that data is uniformly distributed without writing a custom partitioner ?

Erick Ramirez Erick Ramirez · Accepted Answer · 2020-09-25T02:55:21

There's no need to make the data distribution more complex by sharding.

The default Murmur3Partitioner will distribute your data evenly across nodes as you approach hundreds of thousands of partitions.

If your use case is really going to hotspot on "data 1", then that's more an inherent problem with your use case/access pattern but it's rare in practice unless you have a super-node issue (for example) in a social graph use case where you have Taylor Swift or Barack Obama having millions more followers than everyone else. Cheers!

data modeling of cassandra for node based use cases

1 Answers