I'm trying to copy 193 GB data from s3 to HDFS. I'm running the following commands for s3-dist-cp and hadoop distcp:
s3-dist-cp --src s3a://PathToFile/file1 --dest hdfs:///user/hadoop/S3CopiedFiles/
hadoop distcp s3a://PathToFile/file1 hdfs:///user/hadoop/S3CopiedFiles/
I'm running these on the master node and also keeping a check on the amount being transferred. It took about an hour and after copying it over, everything gets erased, disk space is shown as 99.8% in the 4 core instances in my cluster, and the hadoop job runs forever. As soon as i run the command,
16/07/18 18:43:55 INFO mapreduce.Job: map 0% reduce 0%
16/07/18 18:44:02 INFO mapreduce.Job: map 100% reduce 0%
16/07/18 18:44:08 INFO mapreduce.Job: map 100% reduce 14%
16/07/18 18:44:11 INFO mapreduce.Job: map 100% reduce 29%
16/07/18 18:44:13 INFO mapreduce.Job: map 100% reduce 86%
16/07/18 18:44:18 INFO mapreduce.Job: map 100% reduce 100%
This gets printed immediately and then copies over data for an hour. It starts all over again.
16/07/18 19:52:45 INFO mapreduce.Job: map 0% reduce 0%
16/07/18 18:52:53 INFO mapreduce.Job: map 100% reduce 0%
Am i missing anything here? Any help is appreciated.
Also I would like to know where can i find the log files on the master node to see if the job is failing and hence looping? Thanks