I read the documentation for Amazon's S3DistCp - it says,
"During a copy operation, S3DistCp stages a temporary copy of the output in HDFS on the cluster. There must be sufficient free space in HDFS to stage the data, otherwise the copy operation fails. In addition, if S3DistCp fails, it does not clean the temporary HDFS directory, therefore you must manually purge the temporary files. For example, if you copy 500 GB of data from HDFS to S3, S3DistCp copies the entire 500 GB into a temporary directory in HDFS, then uploads the data to Amazon S3 from the temporary directory".
This is not insignificant especially if you have a large HDFS cluster. Does anybody know if the regular Hadoop DistCp has this same behaviour of staging the files to copy in a temporary folder?