Kerberos authentication with Hadoop cluster from Spark stand alone cluster running on Kubernetes cluster

Question

I have set up Spark Standalone cluster on Kubernetes, and I am trying to connect to a Kerberized Hadoop cluster which is NOT on Kubernetes. I have placed core-site.xml and hdfs-site.xml in my Spark cluster's container and have set HADOOP_CONF_DIR accordingly. I am able to successfully generate the kerberos credential cache in the Spark container for principal which accesses the Hadoop cluster. But when I run spark-submit, it fails with below Access control exception in worker. Note - master and workers are running in separate Kubernetes pods.

spark-submit --master spark://master-svc:7077 --class myMainClass myApp.jar
Client cannot authenticate via: [TOKEN, KERBEROS]

However, when I run spark-submit from the Spark container in local mode, it's able to talk to Hadoop cluster successfully.

spark-submit --master local[*] --class myMainClass myApp.jar

Is there any configuration I need to set to make Worker use the credential cache in Spark Stand alone mode?

Samson Scharfrichter Samson Scharfrichter · Accepted Answer · 2020-05-13T12:05:50

You have a huge problem: AFAIK Spark Standalone does not handle any kind of authentication.

in local mode, the Spark client/driver/executors all live in the same JVM, the Hadoop client libs can access directly the Kerberos ticket present in the local cache (hence Spark doesn't have to manage anything)
in yarn-cluster mode, the Spark client uses the local Kerberos ticket to connect to Hadoop services and retrieve special auth tokens that are then shipped to the YARN container running the driver; then the driver broadcasts the token to the executors
in yarn-client mode it's similar with a shortcut, as the Spark driver runs with the client and has the token already available
with Spark Standalone you are screwed.

Cf. https://stackoverflow.com/a/44290544/5162372 for more details about Kerberos auth to Hive or HBase from Spark in yarn-* modes.

Cf. also the --principal and --keytab params needed by long-running jobs (e.g. Streaming) that need to renew their Kerberos creds on-the-fly, from within the driver (since the Spark client has probably terminated just after launch)

Maybe you could try spark.yarn.access.namenodes to see if that forces the Spark client to fetch "additional" Kerberos tokens, but I would not bet on it, since that property will probably be ignored in Spark Standalone mode.

Cf. the comment by Steve Loughran on Access a secured Hive when running Spark in an unsecured YARN cluster

Kerberos authentication with Hadoop cluster from Spark stand alone cluster running on Kubernetes cluster

1 Answers