I have a cql table which has 2 columns
{
long minuteTimeStamp -> only minute part of epoch time. seconds are ignored.
String data -> some data
}
I have a 5 node cassandra cluster and I want to distribute per minute data uniformly on all 5 nodes. So if per minute data is ~10k records, so each node should consume ~2k data.
I also want to consume each minute data parallelly, means 5 different readers read data 1 on each node.
I came to one solution like I also keep one more column in table like
{
long minuteTimeStamp
int shardIdx
String data
partition key : (minuteTimeStamp,shardIdx)
}
By doing this while writing the data, I will do circular round-robin on shardIdx. Since cassandra uses vnodes, so it might be possible that (min0,0) goes to node0, and (min0,1) also goes to node0 only as this token might also belong to node0. This way I can create some hotspots and it will also hamper read, as 5 parallel readers who wanted to read 1 on each node, but more than one reader might land to same node.
How can we design our partition-key so that data is uniformly distributed without writing a custom partitioner ?