10
votes

When I read about sharding, looks like authors don't take into account other tables the sharded table has to be joined to (even though they describe a shard as a "subset of an original database"). However, this is a very common situation and I still don't have a clue how to handle that. Some of the authors mention "static" tables referenced by a sharded table that may be replicated to each shard (for example, Country). However, they say nothing about tables referencing the sharded one.

Imagine that we run a social network and realize that our User table (id, name) cannot fit to a single server anymore because of an enormous amount of writes or because of size (or both). So we decide to partition it horizontally into multiple shards (say, 4, so users with id 1-1000 go to one shard, 1001-2000 to another etc.) and choose a User.id as a shard key. Since the User table is routinely joined to other tables, we move records from tables referencing a given user or referenced by it to a corresponding shard (this is quite a challenge because relations are often transitive, for example, table A may reference B which references the sharded table C). In order to simplify things, we can decide to replicate all but the User table to all shards in their entirety. So far so good.

Then, imagine the Friends table (id, user_id, friend_id) containing information regarding who is a friend of who and referencing the User table. A user 1001 has 2 friends, 2002 and 3003, and they are located on different shards. So if we need to fetch information about the user 1001 friends, we will have to perform 2 cross-shard joins. Even if we managed to places all related users on the same shard initially, a user can add a new friend from a different shard. We cannot move this friend 4004 to the user 1001 because other users from the same shard #5 can also have him as a friend.

To be honest, I cannot figure out how situations like this are handled when sharding is performed and I haven't seen any resources explaining that.

1

1 Answers

5
votes

When doing a join across sharded tables what you generally want to optimize for is the amount of data being transferred across the shards.

There are 5 types of distributed joins, as explained here, ordered from most preferred to least:

  1. Local/Collocated Reference Table Join

This is the example you mentioned with the Countries table. Each node keeps a copy of this table locally because it's rarely updated. The pros and cons are obvious: not everything can be marked as a reference table, but there is no data movement involved.

  1. Local/Collocated Distributed Table Join

This means sending a query to all nodes where the data required for the join is located. Then, the overall execution result is aggregated. The con is that the tables need to be sharded on the columns involved in the join condition. This is the most scalable algorithm as it involves no data movement before doing the join.

This would work with your Friends table example. Presumably, your Users table will be keyed by the User id, which is also the shard key, so there will be an index, so queries should be fast.

  1. Remote Distributed Table Join

All nodes in the cluster send the data they have for the two sides of the join to a single node so it can run the join. This type of join only performs well when the number of rows involved in the join are small.

  1. Broadcast Join

If you're doing a join where one of the sides has a large dataset and the other one has a small dataset, this type sends the small dataset to the larger one, and the node with the large dataset does the join locally.

  1. Shuffle Join

This is the most expensive but flexible way of running a distributed join. A lot of data movement is needed as many of the rows involved in the join are copied to other nodes to execute the join.