3
votes

I was wondering if sharding is an alternate name for partial replication or not. What I have figured out that --

Partial Repl. – each data item has only copies at some but not all of the nodes (‘Sharding’?)

Pure Partial Repl. – has only copies of a subset of the data item but no node contains a full copy of the database

Hybrid Partial Repl. – a set of nodes are full replicas and another set of nodes are partial replicas

2

2 Answers

4
votes

Partial replication is an interesting way, in which you distribute the data with replication from a master to slaves, each contains a portion of the data. Eventually you get an array of smaller DBs, read only, each contains a portion of the data. Reads can very well be distributed and parallelized.

But what about the writes?

Those are still clogged, in 1 big fat lazy master database, tasks as buffer management, locking, thread locks/semaphores, and recovery tasks - are the real bottleneck of the OLTP, they make writes impossible to scale... See more in my blog post here: http://database-scalability.blogspot.com/2012/08/scale-up-partitioning-scale-out.html. BTW - your topic right here just gave me a great idea for another post. I'll link to this question and give you the credit! :)

Sharding is where data appears only once, within an array of DBs. Each database is the complete owner of the data, data is read from there, data is written to there. This way, reads and writes are distributed and parallelized. Real scale-out can be acheived.

Sharding is a mess to handle, to maintain, it's hard as hell. ScaleBase (I work there), enable automatic transparent scale-out, just throw it in the middle and you'll have 10 DBs at the back, and it'll look like 1 to your app. Automatic, transparent super-sharding - in a box.

3
votes

Sharding is a method of horizontal partitioning of a table. It doesn't related to replication. Traditionally an RDBMS server located in the center of system with star like topology. That's why it becomes:

  1. the single point of failure

  2. the performance bottleneck of the system

To resolve issue #1 you use replication: if original server dies you fail over to a replica.

To resolve issue #2 you can:

  1. use sharding

    1.1 do sharding by yourself

    1.2 use your RDBMS "out of the box" clustering mechanism

  2. migrate to a NoSQL solution

Sharding allows you to scale out database to many servers by splitting the data among them. However sharding is a trade-off. It limits you in data joining/intersecting/etc.

You still have issue #1 if you use sharding. So it's a good practice to replicate sharded nodes.