0
votes

Is there a way to copy a list of files from S3 to hdfs instead of complete folder using s3distcp? this is when srcPattern can not work.

I have multiple files on a s3 folder all having different names. I want to copy only specific files to a hdfs directory. I did not find any way to specify multiple source files path to s3distcp.

Workaround that I am currently using is to tell all the file names in srcPattern

hadoop jar s3distcp.jar
    --src s3n://bucket/src_folder/
    --dest hdfs:///test/output/
    --srcPattern '.*somefile.*|.*anotherone.*'

Can this thing work when the number of files is too many? like around 10 000?

2

2 Answers

4
votes

hadoop distcp should solve your problem. we can use distcp to copy data from s3 to hdfs.

And it also supports wildcards and we can provide multiple source paths in the command.

http://hadoop.apache.org/docs/r1.2.1/distcp.html

Go through the usage section in this particular url

Example: consider you have the following files in s3 bucket(test-bucket) inside test1 folder.

abc.txt
abd.txt
defg.txt

And inside test2 folder you have

hijk.txt
hjikl.txt
xyz.txt

And your hdfs path is hdfs://localhost.localdomain:9000/user/test/

Then distcp command is as follows for a particular pattern.

hadoop distcp s3n://test-bucket/test1/ab*.txt \ s3n://test-bucket/test2/hi*.txt hdfs://localhost.localdomain:9000/user/test/
3
votes

Yes you can. create a manifest file with all the files you need and use --copyFromManifest option as mentioned here