0
votes

I am trying to understand what are some good practices for database sharding in the real world. If you have a system like Gmail, what is the ideal way to do sharding?

I've looked here and it mentions 3 ways to do sharding. It seems like 2 of those seem ideal for an email system:

Lookup based sharding where you have a look up table at your frontend machine to route you to the backend that contains the data for your query.

Hash based sharding where you use a hash function on your input query to identify the backend machine that contains the data.

The problem with the Lookup table is that it could quickly grow and it may be need to be sharded as well. It doesn't seem like a good design if we're sharding the lookup table that we use to look up the shard that contains our data.

The hash based sharding would create an issue when you have to add more servers to your system. Typical hash functions would use the number of servers to calculate the source shard. So, increasing the number of servers would change the result of a hash function for a given input. Is the trick to just come up with a hash function that is does not depend on the number of servers? If so, what would be an example for an email system?

So, what would be a good sharding technique to use when you have to develop a scalable system such as Gmail?

2

2 Answers

0
votes

I prefer a compromise. Let's say I have a dozen shards today. I might use a hash table with 1024 entries. Each entry would say which of the dozen shards those users reside on.

Advantages:

  • A manageable table size (1024).
  • Load balancing by moving one hash value from one shard to another. Or even use it when adding a new shard.
0
votes

It maybe cause performance issue with Lookup based sharding, the queries need to be cross twice. And it is same the problem if the lookup table need to be sharding too. Hash based sharding is better, but need to migrate data when storage node increased.

I prefer Hash based sharding. But we need a migrate tool to manage data scale out automatically.

Maybe Apache ShardingSphere can resolve data migration problem. There are ShardingSphere-Sacling module for migrate automatically. The system do not need to stop during data migrating.

By the way, this project can resolve transparent sharding problem too.

FYI of ShardingSphere-Sacling: https://shardingsphere.apache.org/document/current/en/user-manual/shardingsphere-scaling/

Source code: https://github.com/apache/shardingsphere/tree/master/shardingsphere-scaling