Yarn report flink job as FINISHED and SUCCEED when flink job failure

Question

I am running flink job on yarn, we use "fink run" in command line to submit our job to yarn, one day we had an exception on flink job, as we didn't enable the flink restart strategy so it simply failed, but eventually we found that the job status was "SUCCEED" from the yarn application list, which we expect to be "FAILED".

Flink CLI log:

06/12/2018 03:13:37 FlatMap (getTagStorageMapper.flatMap)(23/32) switched to CANCELED 
06/12/2018 03:13:37 GroupReduce (ResultReducer.reduceGroup)(31/32) switched to CANCELED 
06/12/2018 03:13:37 FlatMap (SubClassEDFJoinMapper.flatMap)(29/32) switched to CANCELED 
06/12/2018 03:13:37 CHAIN DataSource (SubClassInventory.AvroInputFormat.createInput) -> FlatMap (SubClassInventoryMapper.flatMap)(27/32) switched to CANCELED 
06/12/2018 03:13:37 GroupReduce (OutputReducer.reduceGroup)(28/32) switched to CANCELED 
06/12/2018 03:13:37 CHAIN DataSource (SubClassInventory.AvroInputFormat.createInput) -> FlatMap (BIMBQMInstrumentMapper.flatMap)(27/32) switched to CANCELED 
06/12/2018 03:13:37 GroupReduce (BIMBQMGovCorpReduce.reduceGroup)(30/32) switched to CANCELED 
06/12/2018 03:13:37 FlatMap (BIMBQMEVMJoinMapper.flatMap)(32/32) switched to CANCELED 
06/12/2018 03:13:37 Job execution switched to status FAILED.
No JobSubmissionResult returned, please make sure you called ExecutionEnvironment.execute()
2018-06-12 03:13:37,625 INFO  org.apache.flink.yarn.YarnClusterClient                       - Sending shutdown request to the Application Master
2018-06-12 03:13:37,625 INFO  org.apache.flink.yarn.YarnClusterClient                       - Start application client.
2018-06-12 03:13:37,630 INFO  org.apache.flink.yarn.ApplicationClient                       - Notification about new leader address akka.tcp://[email protected]:45663/user/jobmanager with session ID 00000000-0000-0000-0000-000000000000.
2018-06-12 03:13:37,632 INFO  org.apache.flink.yarn.ApplicationClient                       - Sending StopCluster request to JobManager.
2018-06-12 03:13:37,633 INFO  org.apache.flink.yarn.ApplicationClient                       - Received address of new leader akka.tcp://[email protected]:45663/user/jobmanager with session ID 00000000-0000-0000-0000-000000000000.
2018-06-12 03:13:37,634 INFO  org.apache.flink.yarn.ApplicationClient                       - Disconnect from JobManager null.
2018-06-12 03:13:37,635 INFO  org.apache.flink.yarn.ApplicationClient                       - Trying to register at JobManager akka.tcp://[email protected]:45663/user/jobmanager.
2018-06-12 03:13:37,688 INFO  org.apache.flink.yarn.ApplicationClient                       - Successfully registered at the ResourceManager using JobManager Actor[akka.tcp://[email protected]:45663/user/jobmanager#182802345]
2018-06-12 03:13:38,648 INFO  org.apache.flink.yarn.ApplicationClient                       - Sending StopCluster request to JobManager.
2018-06-12 03:13:39,480 INFO  org.apache.flink.yarn.YarnClusterClient                       - Application application_1528772982594_0001 finished with state FINISHED and final state SUCCEEDED at 1528773218662
2018-06-12 03:13:39,480 INFO  org.apache.flink.yarn.YarnClusterClient                       - YARN Client is shutting down
2018-06-12 03:13:39,582 INFO  org.apache.flink.yarn.ApplicationClient                       - Stopped Application client.
2018-06-12 03:13:39,583 INFO  org.apache.flink.yarn.ApplicationClient                       - Disconnect from JobManager Actor[akka.tcp://[email protected]:45663/user/jobmanager#182802345].

Flink job manager Log:

FlatMap (BIMBQMEVMJoinMapper.flatMap) (32/32) (67a002e07fe799c1624a471340c8cf9d) switched from CANCELING to CANCELED.
Try to restart or fail the job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) if no longer possible.
Requesting new TaskManager container with 8192 megabytes memory. Pending requests: 1
Job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) switched from state FAILING to FAILED.
Could not restart the job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) because the restart strategy prevented it.
Unregistered task manager ip-10-97-44-186/10.97.44.186. Number of registered task managers 31. Number of available slots 31
Stopping JobManager with final application status SUCCEEDED and diagnostics: Flink YARN Client requested shutdown
Shutting down cluster with status SUCCEEDED : Flink YARN Client requested shutdown
Unregistering application from the YARN Resource Manager
Waiting for application to be successfully unregistered.

Can anybody help me understand why does yarn say my flink job was "SUCCEED"?

Till Rohrmann Till Rohrmann · Accepted Answer · 2018-06-15T06:35:58

The reported application status in Yarn does not reflect the status of the executed job but the status of the Flink cluster since this is the Yarn application. Thus, the final status of the Yarn application only depends on whether the Flink cluster finished properly or not. Differently said, if a job fails, then it does not necessarily mean that the Flink cluster failed. These are two different things.

Yarn report flink job as FINISHED and SUCCEED when flink job failure

1 Answers