3
votes

I have long running spark streaming job which reads from kafka. This job is started once and expected to run forever.

Cluster is kerberized.

What I have observed is that job runs fine for few days (more than 7 days). At the start of job we can see that it acquires HDFS delegation token which is valid for 7 days.

18/06/17 12:32:11 INFO hdfs.DFSClient: Created token for user: HDFS_DELEGATION_TOKEN owner=user@domain, renewer=yarn, realUser=, issueDate=1529213531903, maxDate=1529818331903, sequenceNumber=915336, masterKeyId=385 on ha-hdfs:cluster

Job keeps running for more than 7 days, but after that period(few days after maxDate) it randomly and suddenly changes status to ACCEPTED. After this it tries to acquire new kerberos ticket and fails giving error for kerberos -

18/06/26 01:17:40 INFO yarn.Client: Application report for application_xxxx_80353 (state: RUNNING)
18/06/26 01:17:41 INFO yarn.Client: Application report for application_xxxx_80353 (state: RUNNING)
18/06/26 01:17:42 INFO yarn.Client: Application report for application_xxxx_80353 (state: ACCEPTED)
18/06/26 01:17:42 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service:}

Final exception -

18/06/26 01:17:45 WARN security.UserGroupInformation: PriviledgedActionException as:user@domain (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]

Note - I already tried passing keytab file so that delegation could be done forever. But I am not able to pass keytab file to spark as its conflicting with kafka jaas.conf.

So there are 3 related questions -

  • Why job could change status from RUNNING to ACCEPTED?
  • Is the issue happening as I am not able to pass keytab? If yes, how to pass keytab when using kafka and spark-streaming over kerberos? -keytab does not work as we are passing keytab with --files. keytab is already configured in jaas.conf and distributed with --files param in spark-submit. Any other way job can acquire new ticket?
  • When job again tries to go to RUNNING state, YARN is rejecting it as it does not have valid KRB ticket. Will it help if we ensure that driver node always has valid KRB ticket? So that when this happens it would be like submitting new spark job; as that node has valid KRB ticket and it will not give kerberos error.
2
What versions of Spark, Hadoop and Kafka are you running? Did you look at stackoverflow.com/questions/47977075/…?tk421
We are using spark 2.1 and able to connect to kafka, problem is only with long running jobs. Version info - Spark 2.1.0.cloudera1, Hadoop 2.6.0-cdh5.8.4, KAFKA-2.1.1reemas

2 Answers

1
votes
  • Why job could change status from RUNNING to ACCEPTED?

A job will transition from RUNNING to ACCEPTED if the application failed and you still have available tries on your AM retries.

  • Is the issue happening as I am not able to pass keytab? If yes, how to pass keytab when using kafka and spark-streaming over kerberos? -keytab does not work as we are passing keytab with --files. keytab is already configured in jaas.conf and distributed with --files param in spark-submit. Any other way job can acquire new ticket?

Yes. Spark allows for long running applications but on a secure system you must pass in a keytab.

Quoting Configuring Spark on YARN for Long-Running Applications with emphasis added:

Long-running applications such as Spark Streaming jobs must be able to write to HDFS, which means that the hdfs user may need to delegate tokens possibly beyond the default lifetime. This workload type REQUIRES passing Kerberos principal and keytab to the spark-submit script using the --principal and --keytab parameters. The keytab is copied to the host running the ApplicationMaster, and the Kerberos login is renewed periodically by using the principal and keytab to generate the required delegation tokens needed for HDFS.

Based on KAFKA-1696, this issue has not been resolved yet so I'm not sure what you can do unless you're running CDH and can upgrade to Spark 2.1.

References:

0
votes

Updating here the solution which solved my problem for the benefit of others. Solution was to simply provide --principal and --keytab as another copied file so that there wont be conflict.

Why job could change status from RUNNING to ACCEPTED?

Application changed the status because of kerberos ticket not being valid. This can happen any time after lease is expired, but does not happen at any deterministic time after lease is expired.

Is the issue happening as I am not able to pass keytab?

It was indeed because of keytab. There is easy solution for this. Simple way to think about this is, whenever HDFS access is required you need to pass keytab and principal if you have streaming job. Just make copy of your keytab file and pass it with : --keytab "my-copy-yarn.keytab" --principal "user@domain" All other considerations are still same like jaas file etc, so you still need to apply those. So this does not interfere with that.

When job again tries to go to RUNNING state, YARN is rejecting it as it does not have valid KRB ticket. Will it help if we ensure that driver node always has valid KRB ticket?

This is essentially happening because YARN is trying to renew ticket internally. It does not really matter if the node that application was launched from has valid ticket at the time of launch of new attempt. YARN has to have sufficient information to renew the ticket and when application was launched, it needs to have valid ticket(second part will always be true as without this job wont even start, but you need to take care of first part)