2
votes

The google cloud provides connectors for working with Hadoop.(https://cloud.google.com/hadoop/google-cloud-storage-connector)

Using the connector, I receive data from hdfs to google cloud storage

ex)

hadoop discp hdfs://${path} gs://${path}

but data is too large(16TB) and receive speed is just 2mb/s

So, I try to change set up distcp ( map property, bandwith property ... )

However speed is same.

How to speed up distcp when transferring data from HDFS to Google Cloud Storage

1
do you know if google has any parameters like amazon fs.s3a.fast.upload=true - AM_Hawk
google cloud does not have parameters like amazon. - Lee. YunSu
Generally to achieve maximum throughput you need to set number of mappers to a number of files that you copy: hadoop discp -m <NUMBER_OF_FILES> hdfs://${path} gs://${path} - Igor Dvorzhak

1 Answers

3
votes

The official documentation states that the one of the best options of transferring data from on-premises clusters to GCP is using a VPN tunnel over the internet or even using multiple VPN tunnels for additional bandwidth.

Other options proposed are using direct peering between Google's edge points of presence (PoPs) and your network, or establishing a direct connection to Google's network with the help of a Cloud Interconnect service provider.