1
votes

How to copy a few terabytes of data from GCS to S3?

There's nice "Transfer" feature in GCS that allows to import data from S3 to GCS. But how to do the export, the other way (besides moving data generation jobs to AWS)?

Q: Why not gsutil? Yes, gsutil supports s3://, but transfer is limited by that machine network throughput. How to easier do it in parallel?

I tried Dataflow (aka Apache Beam now), that would work fine, because it's easy to parallelize on like a hundred of nodes, but don't see there's simple 'just copy it from here to there' function.

UPDATE: Also, Beam seems to be computing a list of source files on the local machine in a single thread, before starting the pipeline. In my case that takes around 40 minutes. Would be nice to distribute it on the cloud.

UPDATE 2: So far I'm inclined to use two own scripts that would:

  • Script A: Lists all objects to transfer, and enqueue a transfer task for each one into a PubSub queue.
  • Script B: Executes these transfer tasks. Runs on cloud (e.g. Kubernetes), many instances in parallel

The drawback is that it's writing a code that may contain bugs etc, not using a built-in solution like GCS "Transfer".

1

1 Answers

2
votes

You could use gsutil running on Compute Engine (or EC2) instances (which may have higher network bandwidth available than your local machine). Using gsutil -m cp will parallelize copying across objects, but individual objects will still be copied sequentially.