2
votes

I have the need to only distcp x number of files.

Couldn't find a way to do it.

  1. One idea is to copy it over a temporary directory and then distcp that directory. Once complete I can delete that temp directory.

  2. Individual distcp commands (for each file). This could be painful.

Not sure if comma separation is allowed.

Any ideas?

Thanks in advance.

1
If they have a pattern, you can make use of the wildcards. Please show us sample of the directory structure. - franklinsijo
Just application directories. Imagine spark application history files. /var/log/spark/appHistory/<appId>/. I just need a handful at a time. So wildcards are not super helpful. - Neelesh Salian

1 Answers

4
votes

You can either pass all the files as sources to the DistCp command

hadoop distcp hdfs://src_nn/var/log/spark/appHistory/<appId_1>/ \
              hdfs://src_nn/var/log/spark/appHistory/<appId_2>/ \
              ....
              hdfs://src_nn/var/log/spark/appHistory/<appId_n>/ \
              hdfs://dest_nn/target/

Or, Create a file containing the list of sources and pass it to the command as source with -f option

hadoop distcp -f hdfs://src_nn/list_of_files hdfs://dest_nn/target/