Authentication failures in cassandra when 1 of 16 nodes is down

Question

I have a Cassandra cluster running :

Cassandra 2.0.11.83 | DSE 4.6.0 | CQL spec 3.1.1 | Thrift protocol 19.39.0

The cluster has 18 nodes, split among 3 datacenters, 6 in each. My system_auth keyspace has the following replication defined:

replication = { 'class': 'NetworkTopologyStrategy', 'DC1': '4', 'DC2': '4', 'DC3': '4'}

and my authenticator/authorizer are set to:

authenticator: org.apache.cassandra.auth.PasswordAuthenticator

authorizer: org.apache.cassandra.auth.CassandraAuthorizer

This morning I brought down one of the nodes in DC1 for maintenance. Within a few seconds/minute client applications started logging exceptions like this:

"User my_application_user has no MODIFY permission on or any of its parents"

Running 'LIST ALL PERMISSIONS of my_application_user' on one of the other nodes shows that user to have SELECT and MODIFY on the keyspace xxxxx, so I am rather confused. Do I have a setup issue? Is this a bug of some sort?

You'll want to make sure you also have increased the replication factor for dse_security, then run nodetool repair on both keyspaces (ref: Configuring system_auth and dse_security keyspace replication | DataStax Enterprise 4.7 Documentation). — BrianC
Thank you BrianC. I checked and I don't seem to have that keyspace - the only ones I have (other than user-created ones) are: "system dse_system system_auth system_traces". This cluster used to be on DSE 4.0.1 at one point and was upgraded to 4.6.0, so maybe that's why? dse_system ks is set with rep of EverywhereStrategy, and system ks is with LocalStrategy. All others are with NetworkTopologyStrategy. What am I missing? Thank you again for responding! — CRCerr0r
the two things to check are that all those keyspaces have enough replication as needed in each DC. The one you show for system_auth is fine, as are any that use EverywhereStrategy (where it happens automatically). Then the second thing is to do a nodetool repair on all nodes for those keyspaces after changing the RF. I don't know if this is your problem or not, but that's where I would start. Have you also tried the "list all permissions" check on each node to make sure they all agree? — BrianC
Hi BrianC, thanks for sticking with me. :) The cluster has an OpsCenter repair service running, and it has completed at least 3-4 repair passes since the last time there was a change to a user name/pass, and even more since the RF was changed last. I also tested logging in as that user on every node (via 'cqlsh LocalNodeIP -u my_application_user -p user_pass -f commands_file_containing_list_all_permissions_for_user' and it logs on fine to all nodes and they agree on the permissions. I will try running a manual repair on the system_auth KS and see how that goes. Any other KSs? — CRCerr0r
BTW, my end goal is to decommission half of the nodes, and the one that is giving me the issue (causing login failures when I brought it down for maintenance) is one of the ones slated for decom, so I need to figure out what the deal is, before nuking it and shooting myself in the foot. :( — CRCerr0r

CRCerr0r CRCerr0r · Accepted Answer · 2015-07-01T21:21:36

Re-posting this as the answer, as BrianC suggested above.

So this is resolved... Here's the sequence of events that seems to have fixed it:

Add 18 more nodes
Run cleanup on original nodes (this was part of the original plan)
Run a scrub on 1 table, since it was throwing exceptions on cleanup
Run a repair on the system_auth KS on the original troubled node
Wait for repair service to complete a full pass on all keyspaces
Decom original 18 nodes.

Honestly, I don't know what fixed it. The system_auth repair makes most sense, but what doesn't make sense is that it had run many passes before, so why work now, I don't know. I hope this at least helps someone.

Authentication failures in cassandra when 1 of 16 nodes is down

1 Answers