I am experiencing some trouble with my hadoop cluster. I tried to do some benchmarks with it to check its performances and see if mapreduce works fine but i got some strange beahviours. The fact is that mapreduce is starting and treating its mapping phase but I got some errors from it : I used teragen for creating data first :
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen 500 random-data
Then the job start and I got some failure without stopping the process:
17/02/23 12:29:27 INFO client.RMProxy: Connecting to ResourceManager at /172.16.138.145:8032
17/02/23 12:29:28 INFO terasort.TeraSort: Generating 500 using 2
17/02/23 12:29:28 INFO mapreduce.JobSubmitter: number of splits:2
17/02/23 12:29:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1487846108320_0007
17/02/23 12:29:28 INFO impl.YarnClientImpl: Submitted application application_1487846108320_0007
17/02/23 12:29:28 INFO mapreduce.Job: The url to track the job: http://172.16.138.145:8088/proxy/application_1487846108320_0007/
17/02/23 12:29:28 INFO mapreduce.Job: Running job: job_1487846108320_0007
17/02/23 12:29:34 INFO mapreduce.Job: Job job_1487846108320_0007 running in uber mode : false
17/02/23 12:29:34 INFO mapreduce.Job: map 0% reduce 0%
17/02/23 12:29:47 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000001_0, Status : FAILED
17/02/23 12:29:48 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000000_0, Status : FAILED
17/02/23 12:30:02 INFO mapreduce.Job: map 50% reduce 0%
17/02/23 12:30:02 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000001_1, Status : FAILED
17/02/23 12:30:03 INFO mapreduce.Job: map 0% reduce 0%
17/02/23 12:30:03 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000000_1, Status : FAILED
17/02/23 12:30:15 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000001_2, Status : FAILED
17/02/23 12:30:16 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000000_2, Status : FAILED
17/02/23 12:30:30 INFO mapreduce.Job: map 100% reduce 0%
17/02/23 12:30:31 INFO mapreduce.Job: Job job_1487846108320_0007 failed with state FAILED due to: Task failed task_1487846108320_0007_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0
I checked the logs in the concerned datanode and found the following lines repeating for each failure :
2017-02-23 11:36:12,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1487846108320_0001_m_000001_1 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP
2017-02-23 11:36:12,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1487846108320_0001_m_000001_1:
2017-02-23 11:36:12,902 INFO [ContainerLauncher #5] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1487846108320_0001_01_000004 taskAttempt attempt_1487846108320_0001_m_000001_1
2017-02-23 11:36:12,903 INFO [ContainerLauncher #5] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1487846108320_0001_m_000001_1
2017-02-23 11:36:12,903 INFO [ContainerLauncher #5] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : Datanode3:34121
2017-02-23 11:36:12,923 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1487846108320_0001_m_000001_1 TaskAttempt Transitioned from FAIL_CONTAINER_CLEANUP to FAIL_TASK_CLEANUP
2017-02-23 11:36:12,924 INFO [CommitterEvent Processor #2] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: TASK_ABORT
2017-02-23 11:36:12,932 WARN [CommitterEvent Processor #2] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete hdfs://172.16.138.145:9000/user/hdfs/random-dataSmallV7.7/_temporary/1/_temporary/attempt_1487846108320_0001_m_000001_1
2017-02-23 11:36:12,932 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1487846108320_0001_m_000001_1 TaskAttempt Transitioned from FAIL_TASK_CLEANUP to FAILED
In this case, the job failed but sometime I get the error but the job will be successful. (rarely) Do you know what could be the cause of this FAIL_CONTAINER_CLEANUP ? Or the potentials causes of this problem ? Here it is only using mappers and no reducer is solicited but when reducer are involves in other cases, the error happens too.
Thank you by advance for your ideas.