Cluster Sharding and Distributed Data - How big it can become?

Question

I have a question about Cluster Sharding and 'state-store-mode=ddata'....

We have an Actor System that has 10 000 000 Actors and we are using Cluster Sharding, out of the box it is configured for 'distributed data' ddata, while in the Cluster Sharding page it is written with big letters in here Persistence Mode Deprecated

    Warning
Persistence for state store mode is deprecated. It is recommended to migrate to ddata for the coordinator state and if using replicated entities migrate to eventsourced for the replicated entities state.

The data written by the deprecated persistence state store mode for remembered entities can be read by the new remember entities eventsourced mode.

Once you’ve migrated you can not go back to persistence mode.

but I also found some articles in the internet that akka distributed data is not such a good idea for big systems, so I think 10 000 000 actors can be defined as big system....

Akka Distributed Data - Scaling

Akka Distributed Data - Large Data Set

So my questions are

Do you know what configuration parameters have an effect for the scaling of the clsuter sharding - distributed data
For Cluster Sharding, my experiments shows, when I have more shards, Sharding Distributed Data scales better. Is this an correct assumption...
If Distributed Data is not approperiate for this Actor numbers, should I stick to the 'persistence' mode, which in Akka Documentation marked as deprecated....
If the ddata and persistence are not way to go for this amount of actor what should use instead....

Thx for answers

Ivan Stanislavciuc Ivan Stanislavciuc · Accepted Answer · 2020-10-23T15:12:12

Do you know what configuration parameters have an effect for the scaling of the clsuter sharding - distributed data

Distributed data does not have any direct effect on the scaling of shards. It can handle up to 100000 entities, which results in supporting for up to 10s of thousands shards.

The communication from the client to the shard allocation strategy is via Distributed Data. It uses a single LWWMap that can support 10s of thousands of shards. Later versions could use multiple keys to support a greater number of shards.

And you can control the amount of shards with following parameter

akka.cluster.sharding {
  # Number of shards used by the default HashCodeMessageExtractor
  # when no other message extractor is defined. This value must be
  # the same for all nodes in the cluster and that is verified by
  # configuration check when joining. Changing the value requires
  # stopping all nodes in the cluster.
  number-of-shards = 1000
}

It will create the specified amount of shards by following rule

By default the shard identifier is the absolute value of the hashCode of the entity identifier modulo the total number of shards.

For Cluster Sharding, my experiments shows, when I have more shards, Sharding Distributed Data scales better. Is this an correct assumption

Yes and no. Too few shards will have a problem of uneven entity allocations, too many shards will have a problem of shard allocation overhead. The golden number of shards is the amount of nodes times 10. See the same docs.

As a rule of thumb, the number of shards should be a factor ten greater than the planned maximum number of cluster nodes. It doesn’t have to be exact. Fewer shards than number of nodes will result in that some nodes will not host any shards. Too many shards will result in less efficient management of the shards, e.g. rebalancing overhead, and increased latency because the coordinator is involved in the routing of the first message for each shard.

If Distributed Data is not appropriate for this Actor numbers, should I stick to the 'persistence' mode, which in Akka Documentation marked as deprecated....

10 million of actors do not influence the size of the sharding state and thus it's appropriate to use ddata mode.

Cluster Sharding and Distributed Data - How big it can become?

1 Answers