3
votes

I want to rsync a bucket with 100M files between s3 and gs. I've got a c3.8xlarge instance and did a quick dry run:

$ time gsutil -m rsync -r -n s3://s3-bucket/ gs://gs-bucket/
Building synchronization state...
At source listing 10000...
^C

real    4m11.946s
user    0m0.560s
sys     0m0.268s

About 4 minutes for 10k files. At this rate, it's going to take 27 days just to compute the sync state. Anything I can do to speed this up?

I also noticed [and fixed] the following warning: WARNING: gsutil rsync uses hashes when modification time is not available at both the source and destination. Your crcmod installation isn't using the module's C extension, so checksumming will run very slowly. If this is your first rsync since updating gsutil, this rsync can take significantly longer than usual. For help installing the extension, please see "gsutil help crcmod".

Are the file hashes computed or am I just waiting for listing 100M files?

1

1 Answers

2
votes

When setting up a sync process between two buckets, the first iteration is going to be the slowest because it needs to copy all of the data in source-bucket to dest-bucket. For cross-provider syncs, this is further slowed down by the need for two separate connections per object -- one to pull the data from the source to the host machine, and another to funnel it through from the host to the destination (gsutil refers to this as "daisy-chain" mode).

For the initial sync (and possibly subsequent syncs as well) between buckets, you might be better off using GCS's transfer service, which allows GCS to copy the objects on your behalf. This tends to be much faster than doing all the work with one machine running gsutil.

As for the warning, it's a general warning that's printed at the beginning of the command execution if you don't have the crcmod C extension installed, regardless of what's present at the destination.