I have long running spark streaming job which reads from kafka. This job is started once and expected to run forever.
Cluster is kerberized.
What I have observed is that job runs fine for few days (more than 7 days). At the start of job we can see that it acquires HDFS delegation token which is valid for 7 days.
18/06/17 12:32:11 INFO hdfs.DFSClient: Created token for user: HDFS_DELEGATION_TOKEN owner=user@domain, renewer=yarn, realUser=, issueDate=1529213531903, maxDate=1529818331903, sequenceNumber=915336, masterKeyId=385 on ha-hdfs:cluster
Job keeps running for more than 7 days, but after that period(few days after maxDate) it randomly and suddenly changes status to ACCEPTED. After this it tries to acquire new kerberos ticket and fails giving error for kerberos -
18/06/26 01:17:40 INFO yarn.Client: Application report for application_xxxx_80353 (state: RUNNING)
18/06/26 01:17:41 INFO yarn.Client: Application report for application_xxxx_80353 (state: RUNNING)
18/06/26 01:17:42 INFO yarn.Client: Application report for application_xxxx_80353 (state: ACCEPTED)
18/06/26 01:17:42 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service:}
Final exception -
18/06/26 01:17:45 WARN security.UserGroupInformation: PriviledgedActionException as:user@domain (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
Note - I already tried passing keytab file so that delegation could be done forever. But I am not able to pass keytab file to spark as its conflicting with kafka jaas.conf.
So there are 3 related questions -
- Why job could change status from RUNNING to ACCEPTED?
- Is the issue happening as I am not able to pass keytab? If yes, how to pass keytab when using kafka and spark-streaming over kerberos? -keytab does not work as we are passing keytab with --files. keytab is already configured in jaas.conf and distributed with --files param in spark-submit. Any other way job can acquire new ticket?
- When job again tries to go to RUNNING state, YARN is rejecting it as it does not have valid KRB ticket. Will it help if we ensure that driver node always has valid KRB ticket? So that when this happens it would be like submitting new spark job; as that node has valid KRB ticket and it will not give kerberos error.