I use Hadoop hadoop-2.0.0-mr1-cdh4.1.2 in a cluster of 40 machines. Each machine has 12 disks used by hadoop. Some disks in one machine were unbalanced, and I decided to re-balance manually as mentioned in this post: rebalance individual datanode in hadoop I stopped the DataNode on that server, moved block file pairs, moved whole sub-directories between some of the disks.
As soon as I stopped the DataNode, the NameNode complained about missing blocks by displaying the following message in the UI: WARNING : There are 2002 missing blocks. Please check the logs or run fsck in order to identify the missing blocks.
Then, I tried to restart the DataNode. It refuses to successfully start and it keeps logging errors and warnings such as follows:
java.io.IOException: Invalid directory or I/O error occurred for dir: /data/disk3/dfs/data/current/BP-208475052-10.165.18.36-1351280731538/current/finalized/subdir61/subdir28
2013-12-20 01:40:29,046 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService java.io.IOException: block pool BP-208475052-10.165.18.36-1351280731538 is not found
2013-12-20 01:40:29,088 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-208475052-10.165.18.36-1351280731538 (storage id DS-737580588-10.165.18.36-50010-1351280778276) service to aspen8hdp19.turner.com/10.165.18.56:54310 java.lang.NullPointerException
2013-12-20 01:40:34,088 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService java.io.IOException: block pool BP-208475052-10.165.18.36-1351280731538 is not found
So, I have some questions:
- Isn't it enough to follow the approach I mentioned? I.e. stop DataNode, move block file pairs and/or subdirectories, restart DataNode.
- Do I need to restart NameNode or other services?
- Why does it complain about missing blocks or corrupt files?
- How can I restart the DataNode and get rid of those exceptions therefore having the DN communicate successfully with the NN?
I appreciate your help. Eduardo.