2
votes

I have a bunch of storage buckets that publish notifications to a Pub/Sub topic when files get uploaded. Then I have a cloud function subscribed to the Pub/Sub topic that copies these files to their final destination buckets. This all works just fine for most files, but when I have large files (> 1GB) they fail to copy. The source buckets are multi-regional and the destination buckets are regional and nearline.

My code is essentially:

client = storage.Client()
src_bucket = client.get_bucket(src_bucket_name)
src_blob = src_bucket.get_blob(src_filename)
dst_bucket = client.get_bucket(dst_bucket_name)
dst_blob = dst_bucket.blob(dst_filename)

dst_blob.rewrite(src_blob)

Initially, the cloud function was timing out at 60 seconds so I assumed that was the issue, but then I bumped the cloud function timeout to 540 seconds, but the function is still timing out. I have the function retrying for 20 minutes so I can see that the issue is repeatable. After bumping the cloud function timeout up failed, I read the blob docs and saw that blob.rewrite also has a default timeout of 60 seconds so I bumped that up to 540 seconds as well, but that is still timing out.

At this point, I am not sure what I am missing. Is this a timeout issue? Or does this have something to do with Pub/Sub publishing multiple messages so I could have multiple cloud functions trying to make the same copy simultaneously? Or is there a better way to move large files between buckets automatically?

2

2 Answers

5
votes

First, a bit on what's happening under the hood:

GCS's rewrite operation is an online operation. When the rewrite request confirms success, the rewrite has been completed and the new operation is ready. The downside is that the user must hold open the rewrite connection while the copying is done. A connection doesn't last forever, though. If the operation is going to take more than, say, 30 seconds or so, the rewrite request may end incomplete. In this case, it will return a rewrite token, which the client must use to resume the request, or else no further progress will be made.

In Python, that looks something like this:

rewrite_token = ''
while rewrite_token is not None:
  rewrite_token, bytes_rewritten, bytes_to_rewrite = dst_blob.rewrite(
      src_blob, token=rewrite_token)
  print(f'Progress so far: {bytes_rewritten}/{bytes_to_rewrite} bytes.')

This doesn't matter for smaller objects, or for objects where the service doesn't need to do any work to move the data around. For big operations, though, you need to check for whether resuming is needed.

That said, timing out is not what I'd expect to see from your code. That's a different sort of failure. Are you sure the error you're getting is a timeout?

-1
votes

I noticed on lines 2 and 4 of your code, you have client.get_bucket - in the GCP docs for this, they don't mention a get_bucket method: https://cloud.google.com/storage/docs/renaming-copying-moving-objects#storage-copy-object-python

(You'll have to click on the "Code Samples" tab, then choose "Python" to see what I'm talking about.)

Does your code work locally in your own Python environment? Silly question, I know, but sometimes folks just go straight to the cloud. ¯\_(ツ)_/¯