I am in the process of doing a rolling restart on a 4-node cluster running Cassandra 2.1.9. I stopped and started Cassandra on node 1 via "service cassandra stop/start", and noted nothing unusual in either system.log or cassandra.log. Doing a "nodetool status" from node 1 shows all four nodes up
user@node001=> nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.187.121 538.95 GB 256 ? c99cf581-f4ae-4aa9-ab37-1a114ab2429b rack1
UN 192.168.187.122 630.72 GB 256 ? bfa07f47-7e37-42b4-9c0b-024b3c02e93f rack1
UN 192.168.187.123 572.73 GB 256 ? 273df9f3-e496-4c65-a1f2-325ed288a992 rack1
UN 192.168.187.124 625.05 GB 256 ? b8639cf1-5413-4ece-b882-2161bbb8a9c3 rack1
But doing the same command from any other nodes shows node 1 still down.
user@node002=> nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 192.168.187.121 538.94 GB 256 ? c99cf581-f4ae-4aa9-ab37-1a114ab2429b rack1
UN 192.168.187.122 630.72 GB 256 ? bfa07f47-7e37-42b4-9c0b-024b3c02e93f rack1
UN 192.168.187.123 572.73 GB 256 ? 273df9f3-e496-4c65-a1f2-325ed288a992 rack1
UN 192.168.187.124 625.04 GB 256 ? b8639cf1-5413-4ece-b882-2161bbb8a9c3 rack1
"nodetool compactionstats" shows no pending tasks, and "nodetool netstats" shows nothing unusual. It's been over 12 hours and these inconsistencies persist. Another example is when I do a "nodetool gossipinfo" on the restarted node, which shows its status as normal:
user@node001=> nodetool -u gossipinfo
/192.168.187.121
generation:1574364410
heartbeat:209150
NET_VERSION:8
RACK:rack1
STATUS:NORMAL,-104847506331695918
RELEASE_VERSION:2.1.9
SEVERITY:0.0
LOAD:5.78684155614E11
HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
DC:datacenter1
RPC_ADDRESS:192.168.185.121
Versus another node, which shows node001's status as "shutdown":
user@node002=> nodetool gossipinfo
/192.168.187.121
generation:1491825076
heartbeat:2147483647
STATUS:shutdown,true
RACK:rack1
NET_VERSION:8
LOAD:5.78679987693E11
RELEASE_VERSION:2.1.9
DC:datacenter1
SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
RPC_ADDRESS:192.168.185.121
SEVERITY:0.0
Is there something I can do to remedy this current situation - so that I can continue with the rolling restart?
x.x.x.121
is considered as Down may cause errors, but this will depend on the replication factor and consistency level used (for instance, RF of 4 and consistency level of ONE won't be affected, while a lower RF or higher consistency will cause issues for sure). Are there any errors in thecassandra/systems.log
file? In the past, I've solved a similar situation restarting the node that was considered asDOWN
(x.121). Finally, it will be better that you runnodetool drain
before restarting the service. – Carlos Monroy Nieblas