1
votes

I am trying to access a firewalled Hadoop cluster running YARN via a SOCKS proxy. The cluster itself is not using proxied connections -- only my client running on a local machine (e.g. a laptop) is connected via ssh -D 9999 user@gateway-host to a machine that can see the Hadoop cluster.

In the Hadoop configuration core-site.xml (on my laptop) I have the following lines:

<property>
    <name>hadoop.socks.server</name>
    <value>localhost:9999</value>
</property>
<property>
    <name>hadoop.rpc.socket.factory.class.default</name>
    <value>org.apache.hadoop.net.SocksSocketFactory</value>
</property>

Accessing HDFS this way works great. However, when I try to submit a YARN job, it fails and I can see in the logs that the nodes are not able to talk to each other:

java.io.IOException: Failed on local exception: java.net.SocketException: Connection refused; Host Details : local host is: "host1"; destination host is: "host2":8030; 
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)

where host1 and host2 are both parts of the hadoop cluster.

I guess what is happening is that the hadoop nodes are trying to communicate via a socks proxy as well and this is obviously failing since no proxy server exists on each host. Is there a way to fix this apart from setting up a dedicated proxy server?

1

1 Answers

1
votes

You are right, the Hadoop nodes must not use the SOCKS proxy for the communication. You can achieve that by marking the SocketFactory setting on the cluster side final.

In core-site.xml on the cluster, add the final tag to the default SocketFactory property:

    <property>
        <name>hadoop.rpc.socket.factory.class.default</name>
        <value>org.apache.hadoop.net.StandardSocketFactory</value>
        <final>true</final>
    </property>

Obviously, you must restart cluster services.