3
votes

I'm trying to evaluate AWS RDS Aurora as future replacement for our local MySQL databases, but I'm noticing some strange behaviors.

I have a basic cluster with a DB master (writer) and a replica (reader). My idea was to use the reader as an always available datasource, even when the writer in unavailable. But when I'm rebooting the master, it takes down the reader as well, making the setup quite worthless.

Looking at the reader replica log, this is what happens when the it notices that the writer is down:

enter image description here

Does anyone know how to have a Aurora read entry point that never goes down even if the writer is offline or busy for a brief time?

Or does the write/read "out of sync" always take down the reader entry points no matter the size of the cluster?

1

1 Answers

4
votes

The only way to have a replica that remains available during a reboot of the master would be to have an asynchronous replica using conventional MySQL replication -- which Aurora does support.

Aurora replication is very different than MySQL (or Galera) replication. A loss of the master necessarily triggers a reorganization of the cluster, because the individual instances don't have their own copies of the data, they share a 6-way replicated storage volume -- that's how replication can remain in the 10-20 ms time range. What's actually replicated from the master is the transaction log LSN. Replacement of a master requires one replica to be promoted, verify that the on-disk data structures are clean after taking over, and then all of the other replicas start follow it.

If the DB cluster has one or more Aurora Replicas, then an Aurora Replica is promoted to the primary instance during a failure event. A failure event results in a brief interruption, during which read and write operations fail with an exception.

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora.Managing.html#Aurora.Managing.FaultTolerance

When an Aurora replica stops seeing updates from the master, it doesn't matter where the actual fault lies -- whether with the actual master or elsewhere in the infrastructure -- the replica stops serving queries because, best case, it no longer has access to authoritative data.

Where possible, zero-downtime patching appears to avoid a master restart during upgrades. Other than upgrades, there should not be a need to restart the master.