By default, SCC resolves all provided contact points into IP addresses on the first connect, and then only uses these IP addresses for reconnection. And after initial connection happened, it discover the rest of the cluster. Usually this is not a problem as SCC should receive notifications about nodes up & down and track nodes IP addresses. But in practice, it could happen that nodes are restarted too fast, and notifications are not received, so Spark jobs that use SCC could stuck trying to connect to the IP addresses that aren't valid anymore - I hit this multiple times on the DC/OS.
This problem is solved with the release of SCC 2.5.0 that includes a fix for SPARKC-571. It introduced a new configuration parameter - spark.cassandra.connection.resolveContactPoints
that when it's set to false
(true
by default) will always use hostnames of the contact points for both initial connection & reconnection, avoiding the problems with changed IP addresses.
So on K8S I would try to use this configuration parameter with just normal Cassandra deployment.