Spark streaming job changing status to ACCEPTED from RUNNING after few days

Question

I have long running spark streaming job which reads from kafka. This job is started once and expected to run forever.

Cluster is kerberized.

What I have observed is that job runs fine for few days (more than 7 days). At the start of job we can see that it acquires HDFS delegation token which is valid for 7 days.

18/06/17 12:32:11 INFO hdfs.DFSClient: Created token for user: HDFS_DELEGATION_TOKEN owner=user@domain, renewer=yarn, realUser=, issueDate=1529213531903, maxDate=1529818331903, sequenceNumber=915336, masterKeyId=385 on ha-hdfs:cluster

Job keeps running for more than 7 days, but after that period(few days after maxDate) it randomly and suddenly changes status to ACCEPTED. After this it tries to acquire new kerberos ticket and fails giving error for kerberos -

18/06/26 01:17:40 INFO yarn.Client: Application report for application_xxxx_80353 (state: RUNNING)
18/06/26 01:17:41 INFO yarn.Client: Application report for application_xxxx_80353 (state: RUNNING)
18/06/26 01:17:42 INFO yarn.Client: Application report for application_xxxx_80353 (state: ACCEPTED)
18/06/26 01:17:42 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service:}

Final exception -

18/06/26 01:17:45 WARN security.UserGroupInformation: PriviledgedActionException as:user@domain (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]

Note - I already tried passing keytab file so that delegation could be done forever. But I am not able to pass keytab file to spark as its conflicting with kafka jaas.conf.

So there are 3 related questions -

Why job could change status from RUNNING to ACCEPTED?
Is the issue happening as I am not able to pass keytab? If yes, how to pass keytab when using kafka and spark-streaming over kerberos? -keytab does not work as we are passing keytab with --files. keytab is already configured in jaas.conf and distributed with --files param in spark-submit. Any other way job can acquire new ticket?
When job again tries to go to RUNNING state, YARN is rejecting it as it does not have valid KRB ticket. Will it help if we ensure that driver node always has valid KRB ticket? So that when this happens it would be like submitting new spark job; as that node has valid KRB ticket and it will not give kerberos error.

What versions of Spark, Hadoop and Kafka are you running? Did you look at stackoverflow.com/questions/47977075/…? — tk421
We are using spark 2.1 and able to connect to kafka, problem is only with long running jobs. Version info - Spark 2.1.0.cloudera1, Hadoop 2.6.0-cdh5.8.4, KAFKA-2.1.1 — reemas

tk421 tk421 · Accepted Answer · 2018-06-27T19:02:03

Why job could change status from RUNNING to ACCEPTED?

A job will transition from RUNNING to ACCEPTED if the application failed and you still have available tries on your AM retries.

Is the issue happening as I am not able to pass keytab? If yes, how to pass keytab when using kafka and spark-streaming over kerberos? -keytab does not work as we are passing keytab with --files. keytab is already configured in jaas.conf and distributed with --files param in spark-submit. Any other way job can acquire new ticket?

Yes. Spark allows for long running applications but on a secure system you must pass in a keytab.

Quoting Configuring Spark on YARN for Long-Running Applications with emphasis added:

Long-running applications such as Spark Streaming jobs must be able to write to HDFS, which means that the hdfs user may need to delegate tokens possibly beyond the default lifetime. This workload type REQUIRES passing Kerberos principal and keytab to the spark-submit script using the --principal and --keytab parameters. The keytab is copied to the host running the ApplicationMaster, and the Kerberos login is renewed periodically by using the principal and keytab to generate the required delegation tokens needed for HDFS.

Based on KAFKA-1696, this issue has not been resolved yet so I'm not sure what you can do unless you're running CDH and can upgrade to Spark 2.1.

References:

What does state transition RUNNING --> ACCEPTED mean?
Hadoop Delegation Tokens Explained - (see section titled "Long-running Applications")
KAFKA-1696 - Kafka should be able to generate Hadoop delegation tokens
YARN Application Security - section "Securing Long-lived YARN Services"
Reading data securely from Apache Kafka to Apache Spark

Spark streaming job changing status to ACCEPTED from RUNNING after few days

2 Answers