What can I do if the replica crashes?
RDS Replicas now support Multi-AZ.
https://aws.amazon.com/about-aws/whats-new/2018/01/amazon-rds-read-replicas-now-support-multi-az-deployments/
Multi-AZ gives you two instances, each with its own EBS volume, in two AZs, only one of them accessible at any one time and the other one sitting idle as a hot standby. When a failure occurs, the backup instance takes over and the DNS hostname for the instance switches from one to the other.
The actual implementation of Multi-AZ is not publicly documented, but it is said that replication is synchronous. The only way that seems possible is if the replication is storage-level replication rather than logical (binlog) replication, and there are various observations you can make which bear this out. It appears that the active instance actually writes to both volumes and the MySQL daemon on the backup instance is not running. When the failover event occurs, the server daemon on the backup is started up and goes through standard MySQL crash recovery.
Enabling Multi-AZ should address the question of what happens in the event of a crash... depending on your definition of "crash."
Replicas can have daily backups and snapshots and can be recovered with point-in-time recovery just like a standalone or master instance... Point-in-time recovery of a DB instance in RDS never modifies the instance being "recovered" -- it creates a new one from a snapshot then rolls it forward using the binlogs.
...but in this case, of course, the "recovered" instance would be a different instance and would no longer be an RDS replica.
What you would need to do in that case would be to recover the failed instance to a point in time, and then create a new replica, and then dump and load the data from the recovered instance onto its replacement -- but only those tables that are not present on the master -- the tables that are unique to the writable replica.
As a point of clarification, MySQL native replication has no problem with tables existing on a replica but not on the master. MySQL replication does have a problem with tables that exist on both master and replica but with different data in the tables -- that's an unsupported configuration, so any plan to make a replica writable must require that tables coming from the master not be changed (with a few exceptions -- notably, additional non-unique indexes can safely be added to tables on a replica for query optimization purposes) -- otherwise, replication will be broken and no further replication events can execute on the replica.
If replication fails due to misuse of the replica (e.g. dropping or changing a table that the master subsequently modifies) it is still a replica as far as RDS is concerned, just a broken one, and can be restored to normal operation including RDS replication... but this is a delicate operation requiring a low-level understanding of MySQL native replication. The gist of such a fix is that the relevant data in the replica's data set must be modified such that it is identical to the data as it existed on the master immediately after the failing replication event executed. Once a replica's data is in this state, replication can be kick-started and will pick up where it left off, eventually catching back up to real-time again.
A note of caution with writable replicas is that if replication fails due to such a condition, you do need to either repair it or destroy it or promote the replica to become its own independent master, which permanently decouples it from its original master -- an operation which cannot be undone. The reason a broken replica must be dealt with reasonably promptly is that RDS has protections that prevent the master from purging its binlogs until no managed replica has further need of them, which could cause them to back up on the master, consuming storage space there, or to pile up saved but unexecuted on the broken replica, consuming space there. The latter condition is more likely, but the former is not impossible to encounter.
As a last resort, and thoroughly unsanctioned, it is possible to configure an RDS instance that is not an RDS replica (e.g. after it has been promoted to master) to connect to another RDS instance and replicate from it, using the same steps that are designed for migrating from on-premise servers onto RDS, with mysql.rds_set_external_master
. This gives you RDS-to-RDS replication that RDS doesn't actually realize is happening.