3
votes

We have been running a 3 Node AWS EMR cluster (1 NameNode , 2 DataNodes) .It is observed that Namenode check-pointing is not happening and fsImage ,md5 files are not updating . Edit logs are piling up resulting in the NameNode failure due to insufficient disk space.

HDFS VErsion : Hadoop 2.8.3-amzn-0

-rw-r--r-- 1 hdfs hdfs        62 Sep  3 12:04 fsimage_0000000000000000000.md5
-rw-r--r-- 1 hdfs hdfs       317 Sep  3 12:04 fsimage_0000000000000000000
-rw-r--r-- 1 hdfs hdfs 260954697 Sep  3 13:49 edits_0000000000000000001-0000000000002061850
-rw-r--r-- 1 hdfs hdfs 270456683 Sep  3 14:54 edits_0000000000002061851-0000000000004196518
-rw-r--r-- 1 hdfs hdfs 256666626 Sep  3 15:54 edits_0000000000004196519-0000000000006223083
-rw-r--r-- 1 hdfs hdfs 256756282 Sep  3 16:54 edits_0000000000006223084-0000000000008250289
-rw-r--r-- 1 hdfs hdfs 263465424 Sep  3 17:59 edits_0000000000008250290-0000000000010330235
-rw-r--r-- 1 hdfs hdfs 257754598 Sep  3 19:49 edits_0000000000010330236-0000000000012365196
-rw-r--r-- 1 hdfs hdfs 257361703 Sep  3 21:39 edits_0000000000012365197-0000000000014396984
-rw-r--r-- 1 hdfs hdfs 258246258 Sep  3 23:29 edits_0000000000014396985-0000000000016435653
-rw-r--r-- 1 hdfs hdfs 257862137 Sep  4 01:19 edits_0000000000016435654-0000000000018471306
-rw-r--r-- 1 hdfs hdfs 257044520 Sep  4 03:09 edits_0000000000018471307-0000000000020496923
-rw-r--r-- 1 hdfs hdfs 256987603 Sep  4 04:59 edits_0000000000020496924-0000000000022520948
-rw-r--r-- 1 hdfs hdfs 254213703 Sep  4 06:44 edits_0000000000022520949-0000000000024522780
-rw-r--r-- 1 hdfs hdfs 265518336 Sep  4 08:34 edits_0000000000024522781-0000000000026613243

As per Hadoop 2.8.3

The Secondary NameNode or CheckpointNode will create a checkpoint of the namespace every 'dfs.namenode.checkpoint.txns'(Default -100000) transactions, regardless of whether 'dfs.namenode.checkpoint.period' (Default - 3600 secs) has expired.

But the checkpointing is not happening in Namenode

2
This is only a comment. I have not verified this on EMR, but I believe that you can restart the namenode to create a new checkpoint as a temporary solution. aws.amazon.com/premiumsupport/knowledge-center/… - John Hanley
I have faced a similar problem and not found a solution. Eventually my namenode runs out of memory and kills the streaming job. As the answer below states you can force the issue, however, by entering safemode it causes the spark checkpoints not to be written and the job to fail, so that is not an option. I wonder if safemode is required to save the image but can never be on because of spark. - Matthew Jackson

2 Answers

1
votes

You can run the following commands for making the NameNode working as workaround:

   hdfs dfsadmin -safemode enter  
   hdfs dfsadmin -saveNamespace  
   hdfs dfsadmin -safemode leave

https://community.hortonworks.com/content/supportkb/49438/how-to-manually-checkpoint.html

0
votes

Checkpoints will be created by either Secondary Namenode or Checkpoint node.

The setup here has the namenode alone which will not create checkpoints on its own.

Checkpoint Node or Secondary Namenode should be available in the setup to do this automatically, else safemode saveSnaphot or Namenode has to be restarted for checkpoint to occur.