Yarn Automatically killing all Jobs exactly after 1 hour with no error

Question

Our yarn is killing all running jobs after exactly 1 hour. Doesn't matter if it is a spark or Sqoop job (mapreduce).

Looking for suggestions on the potential cause.

We are using HDP 2.5.x hadoop distribution on a 4 node cluster.

This is how I am running sqoop job

nohup sqoop-import -D mapred.task.timeout=0 --direct --connect jdbc:oracle:thin:@HOST:Port:DB --username USERNAME --password PASS --target-dir /prod/directory  --table TABLE_NAME --verbose -m 25 --split-by TABLE_NAME.COLUMN --as-parquetfile --fields-terminated-by "\t" > temp.log 2>&1 &

All it says is as below

16/11/26 01:40:49 INFO mapreduce.Job:  map 42% reduce 0%
16/11/26 01:41:44 INFO mapreduce.Job:  map 0% reduce 0%
16/11/26 01:41:44 INFO mapreduce.Job: Job job_1480141487938_0001 failed with state KILLED due to: Application killed by user.
16/11/26 01:41:44 INFO mapreduce.Job: Counters: 0
16/11/26 01:41:44 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
16/11/26 01:41:44 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 3,628.6498 seconds (0 bytes/sec)
16/11/26 01:41:44 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
16/11/26 01:41:44 INFO mapreduce.ImportJobBase: Retrieved 0 records.
16/11/26 01:41:44 DEBUG util.ClassLoaderStack: Restoring classloader: sun.misc.Launcher$AppClassLoader@131276c2
16/11/26 01:41:44 ERROR tool.ImportTool: Error during import: Import job failed!

Yarn application log

yarn logs -applicationId application_1480141487938_0001|grep -B2 -A10 "ERROR "
16/11/26 03:05:39 INFO impl.TimelineClientImpl: Timeline service address: http://HostName:8188/ws/v1/timeline/
16/11/26 03:05:39 INFO client.RMProxy: Connecting to ResourceManager at HostName/HostIp:8050
16/11/26 03:05:39 INFO client.AHSProxy: Connecting to Application History server at HostName/HostIp:10200
16/11/26 03:05:40 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
16/11/26 03:05:40 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
2016-11-26 00:41:33,284 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1480141487938_0001: ask=1 release= 2 newContainers=0 finishedContainers=2 resourcelimit=<memory:20480, vCores:1> knownNMs=4
2016-11-26 00:41:33,285 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_e09_1480141487938_0001_01_000028
2016-11-26 00:41:33,285 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_e09_1480141487938_0001_01_000028
2016-11-26 00:41:33,285 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_e09_1480141487938_0001_01_000029
2016-11-26 00:41:33,285 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_e09_1480141487938_0001_01_000029
2016-11-26 00:41:33,686 INFO [Socket Reader #1 for port 41553] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1480141487938_0001 (auth:SIMPLE)
2016-11-26 00:41:33,697 INFO [IPC Server handler 6 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID : jvm_1480141487938_0001_m_9895604650011 asked for a task
2016-11-26 00:41:33,698 INFO [IPC Server handler 6 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: jvm_1480141487938_0001_m_9895604650011 given task: attempt_1480141487938_0001_m_000024_0
2016-11-26 00:41:37,542 INFO [IPC Server handler 19 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1480141487938_0001_m_000000_0 is : 0.0
2016-11-26 00:41:38,793 INFO [IPC Server handler 22 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1480141487938_0001_m_000001_0 is : 0.0
2016-11-26 00:41:38,811 INFO [IPC Server handler 23 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1480141487938_0001_m_000006_0 is : 0.0
2016-11-26 00:41:38,939 INFO [IPC Server handler 28 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1480141487938_0001_m_000007_0 is : 0.0
2016-11-26 00:41:40,568 INFO [IPC Server handler 22 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1480141487938_0001_m_000000_0 is : 0.0
2016-11-26 00:41:41,812 INFO [IPC Server handler 24 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1480141487938_0001_m_000001_0 is : 0.0
2016-11-26 00:41:41,832 INFO [IPC Server handler 25 on 41553] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1480141487938_0001_m_000006_0 is : 0.0

Rm Audit Log

2016-11-26 01:41:43,359 INFO resourcemanager.RMAuditLogger: USER=yarn   IP=HostIp   OPERATION=Kill Application Request  TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1480141487938_0001    CALLERCONTEXT=CLI

I have already modified every value i can find in Ambari from 3600 to some larger value, restarted cluster and re-ran script. Still exactly job gets killed after 1 hour both for sqoop and spark jobs.

Edit:

yarn logs -show_application_log_info -applicationId application_1480141487938_0001

shows only container id's from 1 to 27. So, where can i find log/error for container 28 and 29?

Did you find out the reason for this ? Happening to us on CDH 5.5 cluster. — morfious902002
We finally go tired of messing up with iptables and decided to just shut it off, and harden complete access to cluster through a gateway node. It is something related to network which causes this, maybe some protocol or some port. — Abhishek Anand
For us it was Livy Server's session timeout which was killing the Spark Batch job. It is a know bug which will be fixed in their next release. — morfious902002
@morfious902002 , good to know. Who knows we might land in that mess in future. ;) — Abhishek Anand

Abhishek Anand Abhishek Anand · Accepted Answer · 2016-12-29T17:11:36

We were never able to isolate the issue completely, only that it was network related. Turns out even when I increased all the possible parameters from 3600 to more, on client/node side some kind of heartbeat was set for 3600 second and was not getting updated.

So, basically after almost an hour the heartbeat would try to communicate, fail and AM will kill complete job.

Since, hadoop, Hortonworks and Cloudera's documents really lack each specific port and protocol specification that needs/should be enabled on every version, we finally had to turn iptables off to resolve this.

Yarn Automatically killing all Jobs exactly after 1 hour with no error

1 Answers