Hadoop Distcp - small files issue while copying between different locations

Question

I have tried to copy 400+ GB and one more distcp job with the size of data 35.6 GB, but both of them took nearly 2 -3 hours for the completion.

We do have enough resources in the cluster.

But when I have examined the container logs, I found it takes so much of time to copy small files. The file in question is a small file.

2019-10-23 14:49:09,546 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying hdfs://service-namemode-prod-ab/abc/xyz/ava/abc/hello/GRP_part-00001-.snappy.parquet to s3a://bucket-name/Data/abc/xyz/ava/abc/hello/GRP_part-00001-.snappy.parquet 2019-10-23 14:49:09,940 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://bucket-name/Data/.distcp.tmp.attempt_1571566366604_9887_m_000010_0

So what can be done to improve this with distcp to make the copy quicker?

Note: the same copy of data on the same cluster to Object Store (internal storage) not AWS S3, but similar to S3 took 4 mins for 98.6 GB.

Command :

hadoop distcp -Dmapreduce.task.timeout=0 -Dfs.s3a.fast.upload=true -Dfs.s3a.fast.buffer.size=157286400 -Dfs.s3a.multipart.size=314572800 -Dfs.s3a.multipart.threshold=1073741824 -Dmapreduce.map.memory.mb=8192 -Dmapreduce.map.java.opts=-Xmx7290m -Dfs.s3a.max.total.tasks=1 -Dfs.s3a.threads.max=10 -bandwidth 1024 /abc/xyz/ava/ s3a://bucket-name/Data/

What can be optimized in terms of value here?

My cluster specs are as follows,

Allocate memory(Cumulative) - 1.2T

Available memory - 5.9T

Allocated VCores(Cumulative) - 119T

Available VCores - 521T

Configured Capacity - 997T

HDFS Used - 813T

Non-HDFS Used - 2.7T

Can anyone suggest a solution to overcome this issue, and suggest an optimal distcp conf for transferring 800 GB - 1 TB files usually from HDFS to Object Store.

BigData-Guru BigData-Guru · Accepted Answer · 2020-06-30T14:34:32

In my project we have copied 20TB through Distcp to S3a. It was taking almost 24Hrs +. However by adding two new buckets and through same Distcp command, the copying reduced to almost 16Hrs.

One more Option is increase the number of Vcores in the cluster.

Hadoop Distcp - small files issue while copying between different locations

1 Answers