TaskTracker on hadoop slave cluster won't start. Can't connect to master

Question

I set up a 2 node hadoop cluster on aws with the namenode and the jobtracker running on the master, and the tasktracker and datanode being both the master and slave. When I start the dfs, It tells me that it starts the namenode, the datanode on both nodes, and the secondary namenode. When I start map reduce it also tells me that the jobtracker was started, as well as the tasktracker on both nodes. I started to run an example to make sure it was working, but it said that only one tasktracker was being used, on the namenode web interface. I checked the logs and bot the datanode and tasktracker node logs on the slave had something along the lines of

2013-08-08 21:31:04,196 INFO org.apache.hadoop.ipc.RPC: Server at ip-10-xxx-xxx-xxx/10.xxx.xxx.xxx:9000 not available yet, Zzzzz...
2013-08-08 21:31:06,202 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ip-10-xxx-xxx-xxx/10.xxx.xxx.xxx:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

The namenode is running on port 9000, This was in the datanode log. In the tasktracker log, it had the same thing except it was port 9001; where the jobtracker was running. I was able to find something on the apache wiki about this error http://wiki.apache.org/hadoop/ServerNotAvailable but I couldn't find any of the possible problems they stated. Since I'm running both nodes on aws I also made sure that permissions were granted to both ports.

In summary.

The tasktracker and datanode on the slave node won't connect to the master

I know the ip addresses are right, i've checked multiple times

I can passphraseless ssh from both instances into each other and into themselves

The ports are granted permission on aws

based on the logs, both the namenode and the jobtracker are running fine

I put the the ips of the master and slave in the config files, rather than a hostname because when i did that and edited the /etc/hosts accordingly, it couldn't resolve it

Does anybody know of any other possible reasons?

I think it's because of the same reason you are getting that error, try mapping like this in /etc/hosts of each the machines: for e.g. 10.x.x.m ip-10-x.x.x and 10.x.x.x ip-10-x-x-n, where m and n are your 1st and 2nd machines IP address. Both of this info should go to both of the machines /etc/hosts. And then you can try pinging each other using the hostname rather than the IP addresses. If that's working, everything should work perfectly. — SSaikia_JtheRocker
I can ping using the host name, but i'm still getting the same error in the logs — Amre
Next thought, you can also try with hostnames instead of IP addresses in the master file and slaves file in each of the machines. And make sure in each of master file, you have the IP address of only the master-node you are planning to use. It will be better if you can post the master and the slaves file here. — SSaikia_JtheRocker
OK, so apparently, it's because the namenode it listening at 127.0.0.1:9000, rather than ip-10.x.x.IpOfMaster:9000 stackoverflow.com/questions/8872807/… — Amre
I just replace localhost:9000 in the config files with ip-10.x.x.x:9000 and it worked — Amre

Unknown Unknown · Accepted Answer · 2017-11-27T15:05:15

Per the original poster:

OK, so apparently, it's because the namenode is listening at 127.0.0.1:9000, rather than ip-10.x.x.IpOfMaster:9000. See Hadoop Datanodes cannot find NameNode. I just replaced localhost:9000 in the config files with ip-10.x.x.x:9000 and it worked.

TaskTracker on hadoop slave cluster won't start. Can't connect to master

1 Answers