1
votes

I am using Apache Hadoop-2.7.1 on cluster that consists of three nodes

nn1 master name node 
nn2 (second name node)   
dn1 (data node)

we know that if we configure high availability in this cluster

we will have two main nodes, one is active and another is standby

and if we configure the cluster to be called by name service too the following scenario will be ok

the scenario is:

1- nn1 is active and nn2 is stand by

so if we want to get file(called myfile) from dn1 we can send this url from browser (webhdfs request)

http://nn1/webhdfs/v1/hadoophome/myfile/?user.name=root&op=OPEN

2- name node daemon in nn1 is killed so according to high availability nn1 is standby and nn2 is active so we can get myfile now by sending this web request to nn2 because it is active now

http://nn2/webhdfs/v1/hadoophome/myfile/?user.name=root&op=OPEN

so configuring name service with high availability is enough for name node failure and for webhdfs to work fine then

so what is the benefit of adding httpfs here because webhdfs with high availibility is not supported and we have to configure httpfs

1
Could you load balance /reverse proxy in front and have it decide to send to either nn1 or nn2?Scovetta
you mean that when we rely on name service we don't have to know ip addresses of two name nodes and there is no webhdfs command to test their status , so we depend on httpfs which is fixed on specific host ,but if we are handling webhdfs from external application and that is my state we can redirect our request to another name node if the request to one of them throws an exception on port 50070 (which is currently standby or turned off),becuase we in advance know their ips ,so here httpfs will have no benifit ,isn't itoula alshiekh
@Scovetta yes that is an option, but Hadoop provides HttpFs to get that done.franklinsijo
load balance name nodes annd using webhdfs is potentially benefitting than using httpFs in terms of performance, unless you don't want to expose all the datanodes to outside you hadoop cluster. Am I right?Betta

1 Answers

1
votes

I understand that this is a follow up of your previous question here.

WebHDFS and HttpFs are two different things. WebHDFS is part of the Namenode and it is the NN that handles the WebHDFS API calls whereas HttpFs is a separate service independent of the Namenodes and the HttpFs server handles the API calls.

what is the benefit of adding httpfs

Your REST API calls will remain the same irrespective of which NN is in Active state. HttpFs, being HA aware, will direct the request to the current Active NN.

Let us assume HttpFs server is started in nn1.

WebHDFS GET request

curl http://nn1:50070/webhdfs/v1/hadoophome/myfile/?user.name=root&op=OPEN

This is served by the Namenode daemon running in nn1.
Scenario 1: nn1 is Active. The request will be rewarded with a valid response.
Scenario 2: nn2 is Active. Making the same request will fail as there is no Active NN running in nn1.

So, the REST call must be modified to request the nn2

curl http://nn2:50070/webhdfs/v1/hadoophome/myfile/?user.name=root&op=OPEN

Now, this will be served by the NN daemon running in nn2.

HttpFs GET request

curl http://nn1:14000/webhdfs/v1/hadoophome/myfile/?user.name=root&op=OPEN

This request is served by the HttpFs service running in nn1.
Scenario 1: nn1 is Active. HttpFs server running in nn1 will direct the request to the current Active Namenode nn1.
Scenario 2: nn2 is Active. HttpFs server running in nn1 will direct the request to the current Active Namenode nn2.

In both scenario, the REST call is same. The request will fail if the HttpFs server is down.

configuring name service with high availability is enough for name node failure and for webhdfs to work fine

nameservice is the logical name given to the pair of Namenodes. This nameservice is not an actual Host and cannot be replaced with the Host parameter in the REST API calls.