0
votes

During the process of failover, Hadoop's ZKFC will take care of switch between ANN <-> SNN. But during this process there is a step called fencing to make sure to shutdown the ANN.

If the ANN's power went off and by having the default strategy of sshfence.

"Switch over will not happen because ssh into ANN will not work and hence compromising the high availability"

From the documentation

"However, when a failover occurs, it is still possible that the previous Active NameNode could serve read requests to clients, which may be out of date until that NameNode shuts down when trying to write to the JournalNodes. For this reason, it is still desirable to configure some fencing methods even when using the Quorum Journal Manager."

  • How do other distributed systems solve this problem without compromising high availability?
  • If there is an already existing solution to above question, why hdfs is not adopting it?
1

1 Answers

0
votes

Hdfs configuration allows to use more than one fencing method.

From the docs :

"In order to do this, you must configure at least one fencing method. These are configured as a carriage-return-separated list, which will be attempted in order until one indicates that fencing has succeeded."

Important thing to notice is that these methods should implement some kind of timeout mechanism or return immediately. The easiest thing to do is to use sshfence with a timeout and shell('/bin/true') as the second method (assuming that ANN is down). Example :

<property>
  <name>dfs.ha.fencing.methods</name>
  <value>sshfence\nshell(/bin/true)</value>
</property>
<property>
  <name>dfs.ha.fencing.ssh.connect-timeout</name>
  <value>30000</value>
</property>

Of course, you can write more sophisticated script that checks if ANN is down.