9
votes

I have 3 DataNodes and 1 NameNode on a machine inside LXC containers. The DataNode on the same node as the NameNode works fine but the other 2 don't i get:

 Initialization failed for Block pool BP-232943349-10.0.3.112-1417116665984 
(Datanode Uuid null) service to hadoop12.domain.local/10.0.3.112:8022 
Datanode denied communication with namenode because hostname cannot be resolved 
(ip=10.0.3.233, hostname=10.0.3.233): DatanodeRegistration(10.0.3.114, 
datanodeUuid=49a6dc47-c988-4cb8-bd84-9fabf87807bf, infoPort=50075, ipcPort=50020, 
storageInfo=lv=-56;cid=cluster24;nsid=11020533;c=0)

in the log file note that my NameNode is at IP 10.0.3.112, and the DataNode failing is at 10.0.3.114 in this case. All nodes FQDNs are defined in the hosts file on all nodes, and I can ping each node from all others.

What puzzles me here is that the DataNode is trying to locate the NameNode at 10.0.3.233 which is NOT an IP in the list nor the IP of the NameNode Why? where is this setup? The second DataNode that fails is at 10.0.3.113 and also looks for a different IP (10.0.3.158) that it can't resolve because it's not defined and does not exist in my setup.

The node that works is at 10.0.3.112 like the NameNode, yet in the log I see it is working with src/ and dst/ files that are IPs out of the range I use. like this:

    src: /10.0.3.112:50010, dest: /10.0.3.180:53246, bytes: 60, op: HDFS_READ, 
cliID: DFSClient_NONMAPREDUCE_-939581249_2253, offset: 0, srvID: a83af9ba-4e1a-47b3-a5d4-
f437ef60c287, blockid: BP-232943349-10.0.3.112-1417116665984:blk_1073742468_1644, 
duration: 1685666

so what exactly is going on here, and how comes I can't reach the NameNode when all my nodes see and resolve each other?

Thanks for help

PS: the /etc/hosts file looks like this:

127.0.0.1   localhost

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

10.0.3.1    bigdata.domain.local
192.168.10.33   bigdata.domain.local
10.0.3.111  hadoop11.domain.local
10.0.3.112  hadoop12.domain.local
10.0.3.113  hadoop13.domain.local
10.0.3.114  hadoop14.domain.local
10.0.3.115  hadoop15.domain.local
10.0.3.116  hadoop16.domain.local
10.0.3.117  hadoop17.domain.local
10.0.3.118  hadoop18.domain.local
10.0.3.119  hadoop19.domain.local
10.0.3.121  hadoop21.domain.local
10.0.3.122  hadoop22.domain.local
10.0.3.123  hadoop23.domain.local
10.0.3.124  hadoop24.domain.local
10.0.3.125  hadoop25.domain.local
10.0.3.126  hadoop26.domain.local
10.0.3.127  hadoop27.domain.local
10.0.3.128  hadoop28.domain.local
10.0.3.129  hadoop29.domain.local

core-site.xml:

<?xml version="1.0" encoding="UTF-8"?>

<!--Autogenerated by Cloudera Manager-->
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://nameservice1</value>
  </property>
  <property>
    <name>fs.trash.interval</name>
    <value>1</value>
  </property>
  <property>
    <name>io.compression.codecs</name>
        <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.Lz4Codec</value>
      </property>
      <property>
        <name>hadoop.security.authentication</name>
        <value>simple</value>
      </property>
      <property>
        <name>hadoop.security.authorization</name>
        <value>false</value>
      </property>
      <property>
        <name>hadoop.rpc.protection</name>
        <value>authentication</value>
      </property>
      <property>
        <name>hadoop.ssl.require.client.cert</name>
        <value>false</value>
        <final>true</final>
      </property>
      <property>
        <name>hadoop.ssl.keystores.factory.class</name>
        <value>org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory</value>
        <final>true</final>
      </property>
      <property>
        <name>hadoop.ssl.server.conf</name>
        <value>ssl-server.xml</value>
        <final>true</final>
      </property>
      <property>
        <name>hadoop.ssl.client.conf</name>
        <value>ssl-client.xml</value>
        <final>true</final>
      </property>
      <property>
        <name>hadoop.security.auth_to_local</name>
        <value>DEFAULT</value>
      </property>
      <property>
        <name>hadoop.proxyuser.oozie.hosts</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.oozie.groups</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.mapred.hosts</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.mapred.groups</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.flume.hosts</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.flume.groups</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.HTTP.hosts</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.HTTP.groups</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.hive.hosts</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.hive.groups</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.hue.hosts</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.hue.groups</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.httpfs.hosts</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.httpfs.groups</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.hdfs.groups</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.hdfs.hosts</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.security.group.mapping</name>
        <value>org.apache.hadoop.security.ShellBasedUnixGroupsMapping</value>
      </property>
      <property>
        <name>hadoop.security.instrumentation.requires.admin</name>
        <value>false</value>
      </property>
    </configuration>

hdfs-site.xml

<

!--Autogenerated by Cloudera Manager-->
<configuration>
  <property>
    <name>dfs.nameservices</name>
    <value>nameservice1</value>
  </property>
  <property>
    <name>dfs.client.failover.proxy.provider.nameservice1</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
  </property>
  <property>
    <name>dfs.ha.automatic-failover.enabled.nameservice1</name>
    <value>true</value>
  </property>
  <property>
    <name>ha.zookeeper.quorum</name>
    <value>hadoop12.domain.local:2181,hadoop13.domain.local:2181,hadoop14.domain.local:2181</value>
  </property>
  <property>
    <name>dfs.ha.namenodes.nameservice1</name>
    <value>namenode114,namenode137</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.nameservice1.namenode114</name>
    <value>hadoop12.domain.local:8020</value>
  </property>
  <property>
    <name>dfs.namenode.servicerpc-address.nameservice1.namenode114</name>
    <value>hadoop12.domain.local:8022</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.nameservice1.namenode114</name>
    <value>hadoop12.domain.local:50070</value>
  </property>
  <property>
    <name>dfs.namenode.https-address.nameservice1.namenode114</name>
    <value>hadoop12.domain.local:50470</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.nameservice1.namenode137</name>
    <value>hadoop14.domain.local:8020</value>
  </property>
  <property>
    <name>dfs.namenode.servicerpc-address.nameservice1.namenode137</name>
    <value>hadoop14.domain.local:8022</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.nameservice1.namenode137</name>
    <value>hadoop14.domain.local:50070</value>
  </property>
  <property>
    <name>dfs.namenode.https-address.nameservice1.namenode137</name>
    <value>hadoop14.domain.local:50470</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.blocksize</name>
    <value>134217728</value>
  </property>
  <property>
    <name>dfs.client.use.datanode.hostname</name>
    <value>true</value>
  </property>
  <property>
    <name>fs.permissions.umask-mode</name>
    <value>022</value>
  </property>
  <property>
    <name>dfs.namenode.acls.enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>dfs.client.read.shortcircuit</name>
    <value>false</value>
  </property>
  <property>
    <name>dfs.domain.socket.path</name>
    <value>/var/run/hdfs-sockets/dn</value>
  </property>
  <property>
    <name>dfs.client.read.shortcircuit.skip.checksum</name>
    <value>false</value>
  </property>
  <property>
    <name>dfs.client.domain.socket.data.traffic</name>
    <value>false</value>
  </property>
  <property>
    <name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
    <value>true</value>
  </property>
</configuration>
3

3 Answers

14
votes

You can just change the configure of hdfs-site.xml of namenode Notice the dfs.namenode.datanode.registration.ip-hostname-check

6
votes

After a lot of issues with this setup, I finally figured what was wrong... Even though my config was right when i set up, it actually happens that resolvconf (the program) tends to reset the /etc/resolv.conf configuration file and overwrite my settings for search domain.local

It also happens that Cloudera and Hadoop use various ways to determine IP address, and unfortunately they are not consistant: Cloudera first looks up IP using SSH, which like PING and other programs use the GLIBC resolver, but later uses HOST, which by passes the GLIBC resolver and uses DNS directly, including the /etc/hosts file, and the /etc/resolv.conf file

So, at first it would work fine, but RESOLVCONF would automatically override my domain and search settings and mess things up.

I ended up REMOVING resolveconf from my setup, and with the proper files in place (hosts, resolv.conf) and making sure HOST resolves to the FQDN, it's all good. So the trick was to remove RESOLVCONF which is installed by default since Ubuntu 10.04 I believe. This is of course true on a local setup like mine. On an actual cluster setup on a network with DNS, just make sure the DNS resolves nodes properly.

0
votes

I had the same problem and after reading all the comments I realized that the problem is in reverse DNS Lookup. I fixed them and then everything just works!