Deduce the HDFS path at runtime on EMR

Question

I have spawned an EMR cluster with an EMR step to copy a file from S3 to HDFS and vice-versa using s3-dist-cp. This cluster is an on-demand cluster so we are not keeping track of the ip.

The first EMR step is: hadoop fs -mkdir /input - This step completed successfully.

The second EMR step is: Following is the command I am using:

s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=s3://<bucket-name>/<folder-name>/sample.txt --dest=hdfs:///input - This step FAILED

I get the following exception Error:

Error: java.lang.IllegalArgumentException: java.net.UnknownHostException: sample.txt at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:678) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:619) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.reduce(CopyFilesReducer.java:213) at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.reduce(CopyFilesReducer.java:28) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:635) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.net.UnknownHostException: sample.txt

But this file does exist on S3 and I can read it through my spark application on EMR.

Why not just use EMRFS and see the direct HDFS S3 mount point? Often an error like this may be caused by region. — Henry
Can you try running the command without the s3EndPoint argument, the transfer always happens through endpoint if there is one specified in the account. If that also doesn't work try using hadoop distcp once. — Chirag

sashmi sashmi · Accepted Answer · 2017-04-26T18:14:25

The solution was while using s3-dist-cp , filename should not be mentioned in both source and destination.

If you want to filter files in the src directory, you can use --srcPattern option

eg: s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=s3://// --dest=hdfs:///input/ --srcPattern=sample.txt.*

Deduce the HDFS path at runtime on EMR

1 Answers