We have a five node Riak cluster(n_val is 3) running on Amazon EC2 spread across multiple availability zones. Since we don't have enterprise edition, we do not have the luxury of multi datacenter replication and a full sync to a different zone/region.
Our current backup strategy is this:
- SSH to each node in the cluster, one node at a time
- Stop riak services using
riak stop(because we are usingleveldbbackend) - Issue a EBS snapshot for the data volume that has riak data
- Start riak service using
riak start - Move on to the other node and repeat above steps
I have tested this approach on a 3 node test cluster which doesn't have much of live activity and recovered from snapshots without an issue. I would like to understand from experts here whether this approach is valid for a production cluster with heavy activity. Will we run into any issues related to handoffs during shutting down node and starting node again? Is there something else I am unaware of at the moment, that might hamper chances of recovery when a disaster occurs?
Thanks in advance!