1
votes

The concept of DB sharding at high level makes sense, split up DB nodes so not a single one is responsible for all of the persistent data. However I'm a little confused on what constitutes the "shard". Does it duplicate entire tables across shards, or usually just a single one?

For instance if we take twitter as an example, at the most basic level we need a users and a tweets table. If we shard based on user ID, with 10 shards, it would reason that the shard function is userID mod 10 === shard location. However what does this mean for the tweets table? Is that separate (a single DB table) or then is every single tweet divided up between the 10 tables, based on the whichever user ID created the tweet?

If it is the latter, and say we shard on something other than user ID, tweet created timestamp for example, how would we know where to look up info relating to the user if all tables are sharded based on tweet creation time (which the user has no concept of)?

1

1 Answers

2
votes

Sharding is splitting the data across multiple servers. The choice of how to split is very critical, and may be obvious.

At first glance, splitting tweets by userid sounds correct. But what other things are there? Is there any "grouping" or do you care who "receives" each tweet?

A photo-sharing site is probably best split on Userid, with meta info for the user's photos also on the same server with the user. (Where the actual photos live is another discussion.) But what do you do with someone who manages to upload a million photos? Hopefully that won't blow out the disk on whichever shard he is on.

One messy case is Movies. Should you split on movies? Reviews? Users who write reviews? Genres?

Sure, "mod 10" is convenient for saying which shard a user is on. That is, until you need an 11th shard! I prefer a compromise between "hashing" and "dictionary". First do mod 4096, then lookup in a 'dictionary' that maps 4096 values to 10 shards. Then, write a robust tool to move one group of users (all with the same mod-4096 value) from one shard to another. In the long run, this tool will be immensely convenient for handling hardware upgrades, software upgrades, trump-sized tweeters, or moving everyone else out of his way, etc.

If you want to discuss sharding tweets further, please provide the main tables that are involved. Also, I have strong opinions on how to you could issue unique ids, if you need them, for the tweets. (There are fiasco ways to do it.)