1
votes

I was watching this video about the CAP theorem, where the author explains well the trade-offs of distributed systems. However I disagree with the CAP theorem in the following aspect. Given the picture below:

enter image description here

Whenever there is a partition, in other words, whenever a slave loses its connection to the master, this slave immediately becomes unavailable. So you will say: You are choosing consistency over availability. And I will say NO!. My distributed system is still highly available because there are many other backup/redundant slave nodes that the client can failover to. So I'm keeping my consistency and I'm keeping my availability in the system. A failing slave node is immediately (and automatically) taking offline and the client is redirected to another slave node for reads.

Then you might say: now what happens if the master node dies, or if you have a partition where two master nodes are active? And the answer is simple: Your system must NEVER allow two master nodes to be active. Your system must always have one and only one master node with as many backup master nodes as you want, however all the backup master nodes will be inactive (i.e. not accepting writes and just building redundant state).

The only trade-off of such a system, because nothing is perfect: It will need human intervention for the case of a dying / bad state master, so that the active master can be shutdown by a human and guaranteed to be dead while the operator turns on (manually) one of the backup masters to take over write requests.

I've been thinking for a long time on how to eliminate this human intervention, but I don't think it is possible due to the fact that a machine cannot reliably determine the state of another machine in a distributed system. A human needs to make this decision and manually pull the plug to kill it.

Wouldn't this simple trade-off (human operator for the rare cases when the master is dying) beat the CAP theorem?

1
Well... And what happens if human is unavailable?vsminkov
That's the tradeoff. It is not unthinkable to have an operator monitoring a system 24/7. For systems that need this degree of availability, an operator will be monitoring the system 24/7 anyways.Pika Sucar
That human intervention is a loss of availability...kuujo

1 Answers

1
votes

As you show, having multiple slave nodes is a way to improve availability. But, since they have state (qty...), a slave node failure while it has modified state that is not yet shared back to the rest of the system will introduce consistency problems. If the network connections between master and slave (or between slaves for data access/updates) has issues, you get partitions, which also can lead to consistency problems. While the master is "down", even if you have an automated new master handling (see zookeeper, paxos, ...) your system is not available - in all these areas, CAP still applies.

You can't beat CAP, but you can still build a usable system by trading off against each of the constraints. You might be able to fix out of sync data after a partition has been repaired, or you could disallow modifications while there is a failure (partial availability...); many systems add networking redundancy to maximize system resistance to partitions and you could even declare that an hour outage every 6 months (as you manually re-elect a new master) is an acceptable availability expectation :-)

You should account for several things in your design:

  • the probability of failure - how often will a master fail, how often will a slave fail, how often will a connection fail?
  • the time taken for detection of a failure (network timeouts, cost of measuring...)
  • the time taken to respond and replace/repair the things impacted by the failure, and
  • the things that need to be done to restore the so-called invarients after a failure (e.g., giving customers an account credit because you sold them an item that is out of stock...)
  • how the state of the system will be shared across all the entities that need to know about it (what's up, down, failing...)
  • the service level agreements you need to meet/provide

When all is said and done, if you can replace your master or offline a broken slave or route around a failed network connection within the SLA for total system availability, without corrupting data and without losing customers, you may well have a viable distributed system design.