Decommission of one EC2 node in cluster causes other nodes to go DOWN/UP and results in "May not be enough replicas..."

Question

We have a 6 node cassandra 1.2.10 cluster running on aws with NetworkTopologyStrategy, a replication factor of 3 and the EC2Snitch. Each AWS availability zone has 2 nodes in it.

When we are reading or writing data with consistency of Quorum to the cluster while decommissioning a node we are getting 'May not be enough replicas present to handle consistency level".

This doesn't make sense because we are only taking one node down, we have an RF of three so even if we take one node down with a quorum read/write there should still be enough nodes with the data (2).

Looking at the cassandra log on a server that we are not decommissioning we are seeing this during the decommission of the other node.

 INFO [GossipTasks:1] 2013-10-21 15:18:10,695 Gossiper.java (line 803) InetAddress /10.0.22.142 is now DOWN
 INFO [GossipTasks:1] 2013-10-21 15:18:10,696 Gossiper.java (line 803) InetAddress /10.0.32.159 is now DOWN
 INFO [HANDSHAKE-/10.0.22.142] 2013-10-21 15:18:10,862 OutboundTcpConnection.java (line 399) Handshaking version with /10.0.22.142
 INFO [GossipTasks:1] 2013-10-21 15:18:11,696 Gossiper.java (line 803) InetAddress /10.0.12.178 is now DOWN
 INFO [GossipTasks:1] 2013-10-21 15:18:11,697 Gossiper.java (line 803) InetAddress /10.0.22.106 is now DOWN
 INFO [GossipTasks:1] 2013-10-21 15:18:11,698 Gossiper.java (line 803) InetAddress /10.0.32.248 is now DOWN

Eventually we are seeing a message that looks like this.

 INFO [GossipStage:3] 2013-10-21 15:18:19,429 Gossiper.java (line 789) InetAddress /10.0.32.248 is now UP

for each of the nodes. So eventually the remaining nodes in the cluster come back to life.

While these nodes are down I can see why we get the "May not be enough replicas..." message. Because everything is down.

My question is why does gossip shutdown for these nodes that we aren't decommissioning in the first place?

qconner qconner · Accepted Answer · 2013-10-25T12:48:52

Gossip is shutting down because several heartbeat messages aren't received from the peer. Usually this is due to over scheduling on the host computer. CPU, network or disk may be overtaxed.

Here at DataStax we are working on making the Gossip state machine more robust for this and other scenarios. If you are a DataStax customer, please open a ticket with Technical Support for tracking and quickest resolution.

Decommission of one EC2 node in cluster causes other nodes to go DOWN/UP and results in "May not be enough replicas..."

1 Answers