The concept of DB sharding at high level makes sense, split up DB nodes so not a single one is responsible for all of the persistent data. However I'm a little confused on what constitutes the "shard". Does it duplicate entire tables across shards, or usually just a single one?
For instance if we take twitter as an example, at the most basic level we need a users and a tweets table. If we shard based on user ID, with 10 shards, it would reason that the shard function is userID mod 10 === shard location
. However what does this mean for the tweets table? Is that separate (a single DB table) or then is every single tweet divided up between the 10 tables, based on the whichever user ID created the tweet?
If it is the latter, and say we shard on something other than user ID, tweet created timestamp for example, how would we know where to look up info relating to the user if all tables are sharded based on tweet creation time (which the user has no concept of)?