1
votes

I am using the following python code to upload a file to remote HDFS from my local system using pyhdfs

from pyhdfs import HdfsClient
client = HdfsClient(hosts='1.1.1.1',user_name='root')
client.mkdirs('/jarvis')
client.copy_from_local('/my/local/file,'/hdfs/path')

Using python3.5/. Hadoop is running in default port : 50070 1.1.1.1 is my remote Hadoop url

Creating dir "jarvis" is working fine, but copying a file is not working. I am getting the following error

Traceback (most recent call last):
File "test_hdfs_upload.py", line 14, in client.copy_from_local('/tmp/data.json','/test.json')
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyhdfs.py", line 753, in copy_from_local self.create(dest, f, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyhdfs.py", line 426, in create metadata_response.headers['location'], data=data, **self._requests_kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 99, in put return request('put', url, data=data, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 44, in request return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 383, in request resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 486, in send r = adapter.send(request, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/adapters.py", line 378, in send raise ConnectionError(e) requests.exceptions.ConnectionError: HTTPConnectionPool(host='ip-1-1-1-1', port=50075): Max retries exceeded with url: /webhdfs/v1/test.json?op=CREATE&user.name=root&namenoderpcaddress=ip-1-1-1-1:9000&overwrite=false (Caused by : [Errno 8] nodename nor servname provided, or not known)

1

1 Answers

4
votes

First, check if webhdfs is enabled for your HDFS cluster. PyHDFS library uses webhdfs and therefore webhdfs needs to be enabled in HDFS config. To enable webhdfs, modify hdfs-site.xml as follows:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/path/to/namenode/dir/</value>
    </property>
    <property>
        <name>dfs.checkpoint.dir</name>
        <value>file:/path/to/checkpoint/dir/</value>
    </property>
    <property>
        <name>dfs.checkpoints.edits.dir</name>
        <value>file:/path/to/checkpoints-ed/dir/</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/path/to/datanode/dir/</value>
    </property>
    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
</configuration>

Also, when the copy_from_local() API call is made from PyHDFS library, HDFS node manager randomly picks and allocates a node from HDFS cluster, and when it does, it may just return a domain name associated to that node. Then an HTTP connection is attempted to that domain to perform an operation. This is when it fails because that domain name isn't understood (can't be resolved) by your host.

To resolve the domains, you will need to add appropriate domain mappings in your /etc/hosts file.

For instance, if you have a HDFS cluster with a namenode and 2 datanodes, with following IP addresses and hostnames:

  • 192.168.0.1 (NameNode1)
  • 192.168.0.2 (DataNode1)
  • 192.168.0.3 (DataNode2)

you will need to update your /etc/hosts file as follows:

127.0.0.1     localhost
::1           localhost
192.168.0.1   NameNode1
192.168.0.2   DataNode1
192.168.0.3   DataNode2

This will enable domain name resolution from your host to your HDFS cluster and you can make webhdfs API calls through PyHDFS.