Temporary storage usage between distcp and s3distcp

Question

I read the documentation for Amazon's S3DistCp - it says,

"During a copy operation, S3DistCp stages a temporary copy of the output in HDFS on the cluster. There must be sufficient free space in HDFS to stage the data, otherwise the copy operation fails. In addition, if S3DistCp fails, it does not clean the temporary HDFS directory, therefore you must manually purge the temporary files. For example, if you copy 500 GB of data from HDFS to S3, S3DistCp copies the entire 500 GB into a temporary directory in HDFS, then uploads the data to Amazon S3 from the temporary directory".

This is not insignificant especially if you have a large HDFS cluster. Does anybody know if the regular Hadoop DistCp has this same behaviour of staging the files to copy in a temporary folder?

scalauser scalauser · Accepted Answer · 2015-02-09T05:44:01

Distcp does not use a temporary folder rather distcp used Map Reduce for the file copy in inter/intra cluster. The same used for HDFS to S3 also. AFAIK distcp will not fail the whole bunch of file copy if it fails for some reason.

If total of 500 GB file copy needs to be happen and if 200 GB of file already copied in and distcp failed you have the 200 GB of data in S3. When you try to rerun the distcp job again it will skip the already existing files.

For more information about commands look at the distcp guide here

Temporary storage usage between distcp and s3distcp

1 Answers