We have a 6 node cassandra 1.2.10 cluster running on aws with NetworkTopologyStrategy, a replication factor of 3 and the EC2Snitch. Each AWS availability zone has 2 nodes in it.
When we are reading or writing data with consistency of Quorum to the cluster while decommissioning a node we are getting 'May not be enough replicas present to handle consistency level".
This doesn't make sense because we are only taking one node down, we have an RF of three so even if we take one node down with a quorum read/write there should still be enough nodes with the data (2).
Looking at the cassandra log on a server that we are not decommissioning we are seeing this during the decommission of the other node.
INFO [GossipTasks:1] 2013-10-21 15:18:10,695 Gossiper.java (line 803) InetAddress /10.0.22.142 is now DOWN
INFO [GossipTasks:1] 2013-10-21 15:18:10,696 Gossiper.java (line 803) InetAddress /10.0.32.159 is now DOWN
INFO [HANDSHAKE-/10.0.22.142] 2013-10-21 15:18:10,862 OutboundTcpConnection.java (line 399) Handshaking version with /10.0.22.142
INFO [GossipTasks:1] 2013-10-21 15:18:11,696 Gossiper.java (line 803) InetAddress /10.0.12.178 is now DOWN
INFO [GossipTasks:1] 2013-10-21 15:18:11,697 Gossiper.java (line 803) InetAddress /10.0.22.106 is now DOWN
INFO [GossipTasks:1] 2013-10-21 15:18:11,698 Gossiper.java (line 803) InetAddress /10.0.32.248 is now DOWN
Eventually we are seeing a message that looks like this.
INFO [GossipStage:3] 2013-10-21 15:18:19,429 Gossiper.java (line 789) InetAddress /10.0.32.248 is now UP
for each of the nodes. So eventually the remaining nodes in the cluster come back to life.
While these nodes are down I can see why we get the "May not be enough replicas..." message. Because everything is down.
My question is why does gossip shutdown for these nodes that we aren't decommissioning in the first place?