3
votes

I have been using Hadoop for the last week or so (trying to get to grips with it), and although I have been able to set up a multinode cluster (2 machines: 1 laptop and a small desktop) and retrieve results, I always seem to encounter "Too many fetch failures" when I run a hadoop job.

An example output (on a trivial wordcount example) is:

hadoop@ap200:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-0.20.203.0.jar wordcount sita sita-output3X
11/05/20 15:02:05 INFO input.FileInputFormat: Total input paths to process : 7
11/05/20 15:02:05 INFO mapred.JobClient: Running job: job_201105201500_0001
11/05/20 15:02:06 INFO mapred.JobClient:  map 0% reduce 0%
11/05/20 15:02:23 INFO mapred.JobClient:  map 28% reduce 0%
11/05/20 15:02:26 INFO mapred.JobClient:  map 42% reduce 0%
11/05/20 15:02:29 INFO mapred.JobClient:  map 57% reduce 0%
11/05/20 15:02:32 INFO mapred.JobClient:  map 100% reduce 0%
11/05/20 15:02:41 INFO mapred.JobClient:  map 100% reduce 9%
11/05/20 15:02:49 INFO mapred.JobClient: Task Id :      attempt_201105201500_0001_m_000003_0, Status : FAILED
Too many fetch-failures
11/05/20 15:02:53 INFO mapred.JobClient:  map 85% reduce 9%
11/05/20 15:02:57 INFO mapred.JobClient:  map 100% reduce 9%
11/05/20 15:03:10 INFO mapred.JobClient: Task Id : attempt_201105201500_0001_m_000002_0, Status : FAILED
Too many fetch-failures
11/05/20 15:03:14 INFO mapred.JobClient:  map 85% reduce 9%
11/05/20 15:03:17 INFO mapred.JobClient:  map 100% reduce 9%
11/05/20 15:03:25 INFO mapred.JobClient: Task Id : attempt_201105201500_0001_m_000006_0, Status : FAILED
Too many fetch-failures
11/05/20 15:03:29 INFO mapred.JobClient:  map 85% reduce 9%
11/05/20 15:03:32 INFO mapred.JobClient:  map 100% reduce 9%
11/05/20 15:03:35 INFO mapred.JobClient:  map 100% reduce 28%
11/05/20 15:03:41 INFO mapred.JobClient:  map 100% reduce 100%
11/05/20 15:03:46 INFO mapred.JobClient: Job complete: job_201105201500_0001
11/05/20 15:03:46 INFO mapred.JobClient: Counters: 25
11/05/20 15:03:46 INFO mapred.JobClient:   Job Counters 
11/05/20 15:03:46 INFO mapred.JobClient:     Launched reduce tasks=1
11/05/20 15:03:46 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=72909
11/05/20 15:03:46 INFO mapred.JobClient:     Total time spent by all reduces waiting  after reserving slots (ms)=0
11/05/20 15:03:46 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
11/05/20 15:03:46 INFO mapred.JobClient:     Launched map tasks=10
11/05/20 15:03:46 INFO mapred.JobClient:     Data-local map tasks=10
11/05/20 15:03:46 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=76116
11/05/20 15:03:46 INFO mapred.JobClient:   File Output Format Counters 
11/05/20 15:03:46 INFO mapred.JobClient:     Bytes Written=1412473
11/05/20 15:03:46 INFO mapred.JobClient:   FileSystemCounters
11/05/20 15:03:46 INFO mapred.JobClient:     FILE_BYTES_READ=4462381
11/05/20 15:03:46 INFO mapred.JobClient:     HDFS_BYTES_READ=6950740
11/05/20 15:03:46 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=7546513
11/05/20 15:03:46 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1412473
11/05/20 15:03:46 INFO mapred.JobClient:   File Input Format Counters 
11/05/20 15:03:46 INFO mapred.JobClient:     Bytes Read=6949956
11/05/20 15:03:46 INFO mapred.JobClient:   Map-Reduce Framework
11/05/20 15:03:46 INFO mapred.JobClient:     Reduce input groups=128510
11/05/20 15:03:46 INFO mapred.JobClient:     Map output materialized bytes=2914947
11/05/20 15:03:46 INFO mapred.JobClient:     Combine output records=201001
11/05/20 15:03:46 INFO mapred.JobClient:     Map input records=137146
11/05/20 15:03:46 INFO mapred.JobClient:     Reduce shuffle bytes=2914947
11/05/20 15:03:46 INFO mapred.JobClient:     Reduce output records=128510
11/05/20 15:03:46 INFO mapred.JobClient:     Spilled Records=507835
11/05/20 15:03:46 INFO mapred.JobClient:     Map output bytes=11435785
11/05/20 15:03:46 INFO mapred.JobClient:     Combine input records=1174986
11/05/20 15:03:46 INFO mapred.JobClient:     Map output records=1174986
11/05/20 15:03:46 INFO mapred.JobClient:     SPLIT_RAW_BYTES=784
11/05/20 15:03:46 INFO mapred.JobClient:     Reduce input records=201001

I did a google on the problem, and the people at apache seem to suggest it could be anything from a networking problem (or something to do with /etc/hosts files) or could be a corrupt disk on the slave nodes.

Just to add: I do see 2 "live nodes" on namenode Admin panel (localhost:50070/dfshealth) and under Map/reduce Admin, I see 2 nodes aswell.

Any clues as to how I can avoid these errors? Thanks in advance.

Edit:1:

The tasktracker log is on: http://pastebin.com/XMkNBJTh The datanode log is on: http://pastebin.com/ttjR7AYZ

Many thanks.

3
What are the exact stacktraces? Please post your task logs.Thomas Jungblut
Thanks Thomas for your reply. I have pasted the logs as above.John M
datanode seems fine, but the tasktracker has serious problems. Did you check the disk with HDParm? Do you have networking problems at all?Thomas Jungblut
Hi Thomas, Thank you for your reply. May I ask why you say tasktracker has serious problems? No, I have not checked the disk with HDParm. I do seem to have some networking problem, but I am not able to pinpoint exactly where these problems occur. What is also strange is that I have exactly 3 Fetch Failures on all run: which I find weird.John M
Are they always on the same host? If so, you should check the networking driver and your harddiskThomas Jungblut

3 Answers

2
votes

Modify datanode node/etc/hosts file.

Each line is divided into three parts. The first part is the network IP address, the second part is the host name or domain name, the third part is the host alias detailed steps are as follows:

  1. First check the host name:

    cat / proc / sys / kernel / hostname

    You will see a HOSTNAME attribute. Change the value of the IP behind on OK and then exit.

  2. Use the command:

    hostname ***. ***. ***. ***

    Asterisk is replaced by the corresponding IP.

  3. Modify the the hosts configuration similarly, as follows:

    127.0.0.1 localhost.localdomain localhost :: 1 localhost6.localdomain6 localhost6 10.200.187.77 10.200.187.77 hadoop-datanode

If the IP address is configured and successfully modified, or show host name there is a problem, continue to modify the hosts file.

1
votes

Following solution will definitely work

1.Remove or comment line with Ip 127.0.0.1 and 127.0.1.1

2.use host name not alias for referring node in host file and Master/slave file present in hadoop directory

  -->in Host file 172.21.3.67 master-ubuntu

  -->in master/slave file master-ubuntu

3. see for NameSpaceId of namenode = NameSpaceId of Datanode

0
votes

I had the same problem: "Too many fetch failures" and very slow Hadoop performance (the simple wordcount example took more than 20 minutes to run on a 2-node cluster of powerful servers). I also got "WARN mapred.JobClient: Error reading task outputConnection refused" errors.

The problem was fixed, when I followed the instruction by Thomas Jungblut: I removed my master node from the slaves configuration file. After this, the errors disappeared and the wordcount example took only 1 minute.