0
votes

Google have Cloud Storage Data Transfer option to copy from one bucket to another but this will only work if both the buckets are in the same project. Using gutil -m rsync -r -d is an easy option to run as cron but we are migrating all bash to python3. So I need a python 3 script to use it as google cloud function to do a weekly copy whole bucket from project1 to another bucket in project2.

Language: python 3
app     : Cloud Function
Process : Copy one bucket to another
Source Project: project1
Source bucket : bucket1
Dest Project: project2
Dest Bucket: bucket2
pseudo cmd: rsync -r gs://project1/bucket1 gs://project2/bucket2

Any quick and readable python 3 code script to do that.

2

2 Answers

0
votes

A python script to do this will be really slow. I would use a Dataflow (apache bream) batch process to do this. You can code this in python3 easily.

Basically you need:

  • One Operation to list all files.
  • One shuffle() operation to distribute the load among several workers.
  • One Operation to actually copy from source to destination.

The good part is Google will scale the workers for you and won't take much time. You'll be billed for the storage operations and the gigabytes + cpu that takes to move al data.

0
votes

Rsync is not an operation that can't be performed via a single request in the storage rest API, and gsutil is not available on Cloud Functions, for this reason rsync both buckets via a python script is not possible.

You can create a function to start a preemptible VM with a startup script that executes the rsync between buckets and shut down the instance after finalizing the rsync operation.

By using a VM instead of a serverless service you can avoid any timeout that could be generated by a long rsync process.

A preemptible VM can run for up to 24Hours before been stopped and you only will charged by the time that the instance is turned on (the disk storage will be charged independently of the status)

If the VM is powered off before a minute you won't be charged by the usage.

For this approach first is necessary to create a bash script in a bucket, this will be executed by the preemptible VM at the startup time for example:

#! /bin/bash
gstuil rsync -r gs://mybucket1 gs://mybucket2 

sudo init 0 #this is similar to poweroff, halt or shutdown -h now

After that, you need to create a preemptible VM with a Startup script, I recommend an f1-micro instance since the rsync command between buckets doesn't require so much resources.

1.- go to the VM Instances page.

2.- Click Create instance.

3.- On the Create a new instance page, fill in the properties for your instance.

4.- Click Management, security, disks, networking, sole tenancy.

5.In the Identity and API access section, select a service account that has access to read your startup script file in Cloud Storage and the buckets to be synced

  1. Select Allow full access to all Cloud APIs.

7.- Under Availability policy, set the Preemptibility option to On. This setting disables automatic restart for the instance, and sets the host maintenance action to Terminate.

8.- In the Metadata section, provide startup-script-url as the metadata key.

9.- In the Value box, provide a URL to the startup script file, either in the gs://BUCKET/FILE or https://storage.googleapis.com/BUCKET/FILE format.

10.Click Create to create the instance.

With this configuration every time that your instance will be started the script also will be executed.

This is the python function to start a VM (independently if this is preemptible)

def power(request):
    import logging
    # this libraries are mandatory to reach compute engine api
    from googleapiclient import discovery
    from oauth2client.client import GoogleCredentials

    # the function will take the service account of your function
    credentials = GoogleCredentials.get_application_default()

    # this line is to specify the api that we gonna use, in this case compute engine    
    service = discovery.build('compute', 'v1', credentials=credentials, cache_discovery=False)

    # set correct log level (to avoid noise in the logs)
    logging.getLogger('googleapiclient.discovery_cache').setLevel(logging.ERROR)

    # Project ID for this request.
    project = "yourprojectID"  # Update placeholder value.
    zone = "us-central1-a"  # update this to the zone of your vm
    instance = "myvm"  # update with the name of your vm

    response = service.instances().start(project=project, zone=zone, instance=instance).execute()

    print(response)
    return ("OK")

requirements.txt file

google-api-python-client
oauth2client
flask

And you can schedule your function by Cloud Scheduler:

  1. Create a service account with functions.invoker permission within your function
  2. Create new Cloud scheduler job
  3. Specify the frequency in cron format.
  4. Specify HTTP as the target type.
  5. Add the URL of your cloud function and method as always.
  6. Select the token OIDC from the Auth header dropdown
  7. Add the service account email in the Service account text box.
  8. In audience field you must only need to write the URL of the function without any additional parameter

On cloud scheduler, I hit my function by using these URL

https://us-central1-yourprojectID.cloudfunctions.net/power

and I used this audience

https://us-central1-yourprojectID.cloudfunctions.net/power

please replace yourprojectID in the code and in the URLs and the zone us-central1