1
votes

I'm trying to do a script to upload files to Google Cloud Storage. I've noticed that there are two ways for doing this:

a) Using gsutil and calling it from python with subprocess b) Using from google.cloud import storage with the "native" methods.

What are the advantages/disadvantages of each method? the (a) method seems to be easier, but I don't know if there is any disadvantage compared to the b) method.

Thanks!

Example of (a)

filename='myfile.csv'
gs_bucket='my/bucket'
parallel_threshold='150M' # minimum size for parallel upload; 0 to disable

subprocess.check_call([
  'gsutil',
  '-o', 'GSUtil:parallel_composite_upload_threshold=%s' % (parallel_threshold,),
  'cp', filename, 'gs://%s/%s' % (gs_bucket, filename)
])

Example of (b)

from google.cloud import storage
def upload_blob(bucket_name, source_file_name, destination_blob_name):
    """Uploads a file to the bucket."""
    # bucket_name = "your-bucket-name"
    # source_file_name = "local/path/to/file"
    # destination_blob_name = "storage-object-name"

    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)

    blob.upload_from_filename(source_file_name)

    print(
        "File {} uploaded to {}.".format(
            source_file_name, destination_blob_name
        )
    )
1

1 Answers

2
votes

The bottom line is that you should just pick the method that best suits your preferences. If it works for you either way, then it's a matter of preference.

However, if you intend to run this code anywhere except for a machine that has gsutil correctly installed and configured, you will have problems. It becomes an external dependency, and you might not enjoy trying to set that up anywhere other than where it already works.

If you want to have an easier time moving this code around, the client library is more predictable and should run anywhere there is an internet connection, assuming you have service account credentials available to your code to initialize the SDK.