Raspberry Pi Hadoop Cluster Configuration

Question

I've recently been trying to build and configure a (8-Pi) Raspberry Pi 3 Hadoop-cluster (as a personal project over the summer). Please bear with me (unfortunately I am a little new to Hadoop). I am using is Hadoop version 2.9.2. I think its important to note that right now I am trying to just get one Namenode and one Datanode completely functional with one-another, before moving ahead and replicating the same procedure on the remaining seven Pi's.

The issue: My Namenode (alias: master) is the only node that is being displayed as a 'Live Datanode' under both the dfs-health interface, and through the use of :

dfsadmin -report

Even though the Datanode is being displayed as an 'Active Node' (within the Nodes of the cluster Hadoop UI) and 'master' is not listed within the slaves file. The configuration I am aiming for is that the Namenode should not perform any of Datanode operations. Additionally I am trying to configure the cluster in such a way that the command above will display my Datanode (alias: slave-01) as a 'Live Datanode'.

I suspect that my issue is caused by the fact that both my Namenode and Datanode make use of the same host-name (raspberrypi), however am unsure of the configuration changes I am required to make in order to correct the issue. After having looked into the documentation, I unfortunately couldn't find a conclusive answer as to whether this is allowed or not.

If someone could please help me solve this issue it would be extremely appreciated! I have provided any relevant file-information below (which I thought may be useful for solving the issue). Thank you :)

PS: All files are identical within the Namenode and Datanode unless otherwise specified.

===========================================================================

Update 1

I have removed localhost from the slaves file on both the Namenode and Datanode, and changed their respective hostnames to 'master' and 'slave-01' as well.

After running JPS: I have noticed that all of the correct processes are running on the master node, however I am having an error on the Datanode for which the log shows:

ExitCodeException exitCode=1: chmod: changing permissions of '/opt/hadoop_tmp/hdfs/datanode': Operation not permitted.

If someone could please help me solve this issue it would be extremely appreciated! Unfortunately the issue persists despite changing permissions using 'chmod 777'. Thanks in advance :)

===========================================================================

Hosts File

127.0.0.1     localhost
::1           localhost ip6-localhost ip6-loopback
ff02::1       ip6-allnodes
ff02::2       ip6-allrouters

127.0.1.1     raspberrypi
192.168.1.2   master
192.168.1.3   slave-01

Master File

master

Slaves File

localhost
slave-01

Core-Site.xml

<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:9000/</value>
    </property>
    <property>
        <name>fs.default.FS</name>
        <value>hdfs://master:9000/</value>
    </property>
</configuration>

HDFS-Site.xml

<configuration>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/opt/hadoop_tmp/hdfs/datanode</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/opt/hadoop_tmp/hdfs/namenode</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.namenode.http-address</name>
        <value>master:50070</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Mapred-Site.xml

<configuration>
    <property>
        <name>mapreduce.job.tracker</name>
        <value>master:5431</value>
    </property>
    <property>
        <name>mapred.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Yarn-Site.xml

<configuration>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>master:8025</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>master:8035</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>master:8050</value>
    </property>
</configuration>

Based on having job.tracker, which doesn't exist in Hadoop 2, and the duplicated fs.default.FS, which is the real value that overwrites the other, looks like you're guessing at config files... You do not need YARN or mapreduce files at all to only get a namenode up and running, then focus on the datanode, and then focus on ResourceManager, NodeManager, and finish off with submitting code — OneCricketeer

OneCricketeer OneCricketeer · Accepted Answer · 2018-12-05T02:49:49

You could let your local router serve up the host names rather than manipulate /etc/hosts yourselves, but in order to change each Pi's name, edit /etc/hostname and reboot.

Before and after boots, check running hostname -f

Note: "master" is really meaningless once you have a "YARN master", "HDFS master", "Hive Master", etc. Best to literally say namenode, data{1,2,3}, yarn-rm, and so on

Regarding permissions issues, you could run everything as root, but that's insecure outside a homelab, so you'd want to run a few adduser commands for at least hduser (as documented elsewhere, but can be anything else), and yarn, then run commands as those users, after chown -R the data and log directories to be owned by these users and Unix groups they belong to