Recover Hadoop NameNode Failure

Question

Scenario 1:

The HDFS fsimage and editlog is written into multiple places including a NFS mount.

A) NameNode Daemon Crash : Solution: Just restart the Namenode process

B) Host is Down where the Name Node is running.

Solution:

Start the namenode in a different host with a empty dfs.name.dir
Point the dfs.name.dir to the NFS mount where we have copy of the meta data. OR
Use --importCheckpoint option while starting namenode after pointing fs.checkpoint.dir to checkpoint directory from Secondary NameNode
Change the fs.default.name to the backup host name URI and restart the cluster with all the slave IP's in slaves file.

Note - We may miss the edit that might have happened after the last checkpoint.

Scenario 2:

The HDFS fsimage is written into a single directory.

A ) NameNode Daemon Crash: Solution : Unknown

B ) Host is down where the Name Node is running.

Solution:

Create a blank directory pointing to dfs.name.dir to directory in (1)
Start the Namenode with -importCheckpoint after pointing fs.checkpoint.dir to checkpoint directory from Secondary NameNode
Change the fs.default.name to the backup host name URI and restart the cluster with all the slave IP's in slaves file.

This way we would miss again the files edited after last checkpoint.

Please let me know if this is how we can manually recover the cluster.

Can you edit your post to include a proper question please? I'm not sure what you're asking. It would also be helpful to post some log file snippet's if you have them. It'll help me diagnose your issues. Thanks — Pradeep Gollakota
I am giving some production scenario. What to do in case of Scenario 1 and Scenario 2. To get different recovery techniques of Namenode — Jagaran
@Jagaran : can you tell me in second scneario case A, restart of namenode as in scenario 1 will not solve the problem? — vishnu viswanath
@Jagaran can you please paste your StackTrace too. I think its the solution for me too. — Murtaza Kanchwala

Harsh J Harsh J · Accepted Answer · 2012-11-15T19:13:43

In production, you should run the NameNodes in HA mode with a quorum of journalling nodes, or a shared HA-NFS storage for the edit log transaction files. If you do not want or use HA, you need to run the NN with at least two storage directories for both images and edit logs, with preferably one as a soft-mounted NFS mount point for automatic off-machine persistence of the name-system.

If you have just one storage directory and no HA configuration, then the best you can get is a past-period checkpoint - if you lose all the files. In case you didn't lose files, you can try a hadoop namenode -recover option as illustrated by this post to be able to recover the image plus some (or all) edits.

Recover Hadoop NameNode Failure

1 Answers