2
votes

I've a Hadoop Cluster at work that has over 50 nodes, We occasionally face disk failures and require to decommission the datanode roles.

My Question is - if I were to only decommission the datanode and leave the tasktracker running, would this result in failed tasks/jobs on this node due to unavailability of HDFS Service on that node?

3

3 Answers

0
votes
  1. Does the TaskTracker on Node1 sit idle since there is no DataNode service on that Node? Correct, if the data node is disabled then the task tracker will not be able to process the data as the data will not be avaiable; it will be idle. 2. or Does the TaskTracker work on data from DataNodes on other Nodes? Nope, due to data locality principle, the task tracker will not process the data from other nodes.. 3. Do we get errors from TaskTracker Service on Node1 due to the DN on it's node being down? , Task tracker will not be able to process any data, so no errors.; 4. if I have services like Hive, Impala, etc running on HDFS - would those services throw error upon contact with TaskTracker on Node1? They will not be able to contact the task tracker on node 1. When client requests for the processing of the data, Name node tells the client about the data locations, so based on the data locations all other applications will communicate with data nodes
0
votes

I would expect any task that tries to read from HDFS on the "dead" node to fail. This should result in the node being blacklisted by M/R after N failures (default is 3 I think). Also, I believe this happens each time a job runs.

However, jobs should still finish since the tasks that got routed to the bad node will simply be retried on other nodes.

0
votes

Firstly, in order to run a job you need to have the input file. So when you load the input file to HDFS this will be split into 64 MB block size by default. Also there will be 3 replications with default settings. Now since one of your data node in the cluster is failed, Name node will not store the data in that node. Even if it tries to store also, it gets the frequent updates from data node about the status. So it will not choose that specific data node to store the data.

It should throw exception when you don't have the disk space and the only dead data node is left in the cluster. Then its time for you to replace the data node and scale up the cluster.

Hope this helps.