Cassandra repair after datacenter went down

Question

I have a Cassandra db (version 3.11.2) running in AWS, with 2 Datacenters - each in another AWS region and 3 nodes in each one.

The replication factor on all keyspaces is 3, so full replication of data on every node. The size of data is about 10GB per node. All of our writes are in LOCAL_QUORUM against one DC (lets call it DC1). Basically the other DC is just for a kind of backup and disaster recovery, in case the AWS region for DC1 will be unavailable we will redirect traffic to DC2.

My issue is that we had a network disconnection between the two DCs, for several hours, and after several days we noticed that there is missing data in DC2. This all makes sense, since the time the DCs were apart is larger than the Hinted Handoff window (3 hours). So we need to run a repair to bring DC2 back to sync with DC1.

I went over the cassandra docs, and read countless SO answers and for the life of me I couldn't understand what is the right repair to do... Do I need to issue a 'nodetool repair --full --sequential' from only one node? Do I need to run it on every node in the cluster? Maybe it's better to run 'nodetool rebuild'?

As long as you don't isolate the repair to the local_dc you should be fine. It will compare replicas from all DCs to come up with the "correct" answers. So you could simply run "nodetool repair" and it should give you what you want — Jim Wartnick

Carlos Monroy Nieblas Carlos Monroy Nieblas · Accepted Answer · 2019-04-10T18:45:40

Executing nodetool cleanup on the nodes on datacenter2 should be able to bring up the data up to sync, but depending on the data size affected, this may be a task that can take time and resources. If the datacenter2 is only as a backup for disaster recovery purposes, it may be easier and quicker to backup the current dc1 cluster and restore it in the second datacenter (more information is available here.

Cassandra repair after datacenter went down

1 Answers