Even Data distribution in Cassandra

Question

I'm new to Cassandra, and I'm stuck at one point.

Consider I have a 5 node cluster with an RF=1 (for simplicity)

Token Ranges 
==============
N1 : 1-100
N2 : 101-200
N3 : 201-300
N4 : 301-400
N5 : 401-500

I have a keyspace with 10 partition keys:

ID (PartitionKey) | Name
------------------------
1                 Joe
2                 Sarah
3                 Eric
4                 Lisa
5                 Kate
6                 Agnus
7                 Lily
8                 Angela
9                 Rodger
10                Chris

10 partition keys ==> implies ==> 10 hash values

partitionkey ==> token generated
=================================
1                 289 (goes on N3)
2                 56 (goes on N1)
3                 78 (goes on N1)
4                 499 (goes on N5)
5                 376 (goes on N4)
6                 276 (goes on N3)
7                 2 (goes on N1)
8                 34 (goes on N1)
9                 190 (goes on N2)
10                68 (goes on N1)

If this is the case, then:

N1 has the partition keys : 2,3,7,8,10
N2 has the partition keys : 9
N3 has the partition keys : 1,6
N4 has the partition keys : 5
N5 has the partition keys : 4

So we see that N1 is loaded compared to others, the other nodes (as per my understanding).

Please help me understand how data is evenly distributed in Cassandra, w.r.t Partitioners and consistent hashing.

pkp9999 pkp9999 · Accepted Answer · 2020-04-14T12:41:33

Selecting the partition key is very important in having even distribution of data among all the nodes. The partition key is supposed to be something that has very high cardinality.

For example, in a 10 node cluster, selecting state of a specific country as partition key may not be very ideal since there’s very high chance of creating hotspots, especially when the number of records itself may not be even across states. Whereas choosing something like zip code may be better or even better than that would be something like customer name or ordernumber. You can explore having a composite partition key if it helps your use case.

Even Data distribution in Cassandra

3 Answers