Cassandra nodes becomes unreachable to each other

Question

I have 3 nodes of elassandra running in docker containers.

Containers created like:

Host 10.0.0.1 : docker run --name elassandra-node-1 --net=host -e CASSANDRA_SEEDS="10.0.0.1" -e CASSANDRA_CLUSTER_NAME="BD Storage" -e CASSANDRA_DC="DC1" -e CASSANDRA_RACK="r1" -d strapdata/elassandra:latest

Host 10.0.0.2 : docker run --name elassandra-node-2 --net=host -e CASSANDRA_SEEDS="10.0.0.1,10.0.0.2" -e CASSANDRA_CLUSTER_NAME="BD Storage" -e CASSANDRA_DC="DC1" -e CASSANDRA_RACK="r1" -d strapdata/elassandra:latest

Host 10.0.0.3 : docker run --name elassandra-node-3 --net=host -e CASSANDRA_SEEDS="10.0.0.1,10.0.0.2,10.0.0.3" -e CASSANDRA_CLUSTER_NAME="BD Storage" -e CASSANDRA_DC="DC1" -e CASSANDRA_RACK="r1" -d strapdata/elassandra:latest

Cluster was working fine for a couple of days since created, elastic, cassandra all was perfect.

Currently however all cassandra nodes became unreachable to each other: Nodetool status on all nodes is like

Datacenter: DC1

Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack DN 10.0.0.3 11.95 GiB 8 100.0% 7652f66e-194e-4886-ac10-0fc21ac8afeb r1 DN 10.0.0.2 11.92 GiB 8 100.0% b91fa129-1dd0-4cf8-be96-9c06b23daac6 r1 UN 10.0.0.1 11.9 GiB 8 100.0% 5c1afcff-b0aa-4985-a3cc-7f932056c08f r1

Where the UN is the current host 10.0.0.1 Same on all other nodes.

Nodetool describecluster on 10.0.0.1 is like

Cluster Information: Name: BD Storage Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch DynamicEndPointSnitch: enabled Partitioner: org.apache.cassandra.dht.Murmur3Partitioner Schema versions: 24fa5e55-3935-3c0e-9808-99ce502fe98d: [10.0.0.1]
            UNREACHABLE: [10.0.0.2,10.0.0.3]

When attached to the first node its only repeating these infos:

2018-12-09 07:47:32,927 WARN [OptionalTasks:1] org.apache.cassandra.auth.CassandraRoleManager.setupDefaultRole(CassandraRoleManager.java:361) CassandraRoleManager skipped default role setup: some nodes were not ready 2018-12-09 07:47:32,927 INFO [OptionalTasks:1] org.apache.cassandra.auth.CassandraRoleManager$4.run(CassandraRoleManager.java:400) Setup task failed with error, rescheduling 2018-12-09 07:47:32,980 INFO [HANDSHAKE-/10.0.0.2] org.apache.cassandra.net.OutboundTcpConnection.lambda$handshakeVersion$1(OutboundTcpConnection.java:561) Handshaking version with /10.0.0.2 2018-12-09 07:47:32,980 INFO [HANDSHAKE-/10.0.0.3] org.apache.cassandra.net.OutboundTcpConnection.lambda$handshakeVersion$1(OutboundTcpConnection.java:561) Handshaking version with /10.0.0.3

After a while when some node is restarted:

2018-12-09 07:52:21,972 WARN [MigrationStage:1] org.apache.cassandra.service.MigrationTask.runMayThrow(MigrationTask.java:67) Can't send schema pull request: node /10.0.0.2 is down.

Tried so far: Restarting all containers at the same time Restarting all containers one after another Restarting cassandra in all containers like : service cassandra restart Nodetool disablegossip then enable it Nodetool repair : Repair command #1 failed with error Endpoint not alive: /10.0.0.2

Seems that all node schemas are different, but I still dont understand why they are marked as down to each other.

Perhaps the containers changed IP? Have you tried running nodetool status on each container? — Simon Fontana Oscarsson
Containers IP are good. All of them shows one UN (the current one where I run nodetool status) and 2 DN with the correct addresses. — Ventsi Popov

LetsNoSQL LetsNoSQL · Accepted Answer · 2018-12-13T04:19:54

If you have different Cassandra version then nodetool repair will not pull the data.Keep same version of Cassandra. sometimes node showing down or unreachable because of gossip was not happening properly. reason may be network, high load on that node or node is very busy and lots of i/o operation on going such as repair, compaction etc.

Cassandra nodes becomes unreachable to each other

Datacenter: DC1

1 Answers