Real World Database Sharding techniques

Question

I am trying to understand what are some good practices for database sharding in the real world. If you have a system like Gmail, what is the ideal way to do sharding?

I've looked here and it mentions 3 ways to do sharding. It seems like 2 of those seem ideal for an email system:

Lookup based sharding where you have a look up table at your frontend machine to route you to the backend that contains the data for your query.

Hash based sharding where you use a hash function on your input query to identify the backend machine that contains the data.

The problem with the Lookup table is that it could quickly grow and it may be need to be sharded as well. It doesn't seem like a good design if we're sharding the lookup table that we use to look up the shard that contains our data.

The hash based sharding would create an issue when you have to add more servers to your system. Typical hash functions would use the number of servers to calculate the source shard. So, increasing the number of servers would change the result of a hash function for a given input. Is the trick to just come up with a hash function that is does not depend on the number of servers? If so, what would be an example for an email system?

So, what would be a good sharding technique to use when you have to develop a scalable system such as Gmail?

Rick James Rick James · Accepted Answer · 2016-04-26T00:13:00

I prefer a compromise. Let's say I have a dozen shards today. I might use a hash table with 1024 entries. Each entry would say which of the dozen shards those users reside on.

Advantages:

A manageable table size (1024).
Load balancing by moving one hash value from one shard to another. Or even use it when adding a new shard.

Real World Database Sharding techniques

2 Answers