0
votes

I have tried to copy 400+ GB and one more distcp job with the size of data 35.6 GB, but both of them took nearly 2 -3 hours for the completion.

We do have enough resources in the cluster.

But when I have examined the container logs, I found it takes so much of time to copy small files. The file in question is a small file.

2019-10-23 14:49:09,546 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying hdfs://service-namemode-prod-ab/abc/xyz/ava/abc/hello/GRP_part-00001-.snappy.parquet to s3a://bucket-name/Data/abc/xyz/ava/abc/hello/GRP_part-00001-.snappy.parquet 2019-10-23 14:49:09,940 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://bucket-name/Data/.distcp.tmp.attempt_1571566366604_9887_m_000010_0

So what can be done to improve this with distcp to make the copy quicker?

Note: the same copy of data on the same cluster to Object Store (internal storage) not AWS S3, but similar to S3 took 4 mins for 98.6 GB.

Command :

hadoop distcp -Dmapreduce.task.timeout=0 -Dfs.s3a.fast.upload=true -Dfs.s3a.fast.buffer.size=157286400 -Dfs.s3a.multipart.size=314572800 -Dfs.s3a.multipart.threshold=1073741824 -Dmapreduce.map.memory.mb=8192 -Dmapreduce.map.java.opts=-Xmx7290m -Dfs.s3a.max.total.tasks=1 -Dfs.s3a.threads.max=10 -bandwidth 1024 /abc/xyz/ava/ s3a://bucket-name/Data/

What can be optimized in terms of value here?

My cluster specs are as follows,

Allocate memory(Cumulative) - 1.2T

Available memory - 5.9T

Allocated VCores(Cumulative) - 119T

Available VCores - 521T

Configured Capacity - 997T

HDFS Used - 813T

Non-HDFS Used - 2.7T

Can anyone suggest a solution to overcome this issue, and suggest an optimal distcp conf for transferring 800 GB - 1 TB files usually from HDFS to Object Store.

1

1 Answers

0
votes

In my project we have copied 20TB through Distcp to S3a. It was taking almost 24Hrs +. However by adding two new buckets and through same Distcp command, the copying reduced to almost 16Hrs.

One more Option is increase the number of Vcores in the cluster.