3
votes

I am trying to figure out if I can use the cp command of gsutil on the Windows platform to upload files to Google Cloud Storage. I have 6 folders on my local computer that get daily new pdf documents added to them. Each folder contains around 2,500 files. All files are currently on google storage in their respective folders. Right now I mainly upload all the new files using Google Cloud Storage Manager. Is there a way to create a batch file and schedule to run it automatically every night so it grabs only files that have been scanned today and uploads it to Google Storage?

I tried this format:

python c:\gsutil\gsutil cp "E:\PIECE POs\64954.pdf" "gs://dompro/piece pos" 

and it uploaded the file perfectly fine.

This command

python c:\gsutil\gsutil cp "E:\PIECE POs\*.pdf" "gs://dompro/piece pos" 

will upload all of the files into a bucket. But how do I only grab files that were changed or generated today? Is there a way to do it?

2

2 Answers

1
votes

One solution would be to use the -n parameter on the gsutil cp command:

python c:\gsutil\gsutil cp -n "E:\PIECE POs\*" "gs://dompro/piece pos/"

That will skip any objects that already exist on the server. You may also want to look at using gsutil's -m flag and see if that speeds the process up for you:

python c:\gsutil\gsutil -m cp -n "E:\PIECE POs\*" "gs://dompro/piece pos/"
1
votes

Since you have Python available to you, you could write a small Python script to find the ctime (creation time) or mtime (modification time) of each file in a directory, see if that date is today, and upload it if so. You can see an example in this question which could be adapted as follows:

import datetime
import os

local_path_to_storage_bucket = [
    ('<local-path-1>', 'gs://bucket1'),
    ('<local-path-2>', 'gs://bucket2'),
    # ... add more here as needed
]

today = datetime.date.today()
for local_path, storage_bucket in local_path_to_storage_bucket:
    for filename in os.listdir(local_path):
        ctime = datetime.date.fromtimestamp(os.path.getctime(filename))
        mtime = datetime.date.fromtimestamp(os.path.getmtime(filename))
        if today in (ctime, mtime):
            # Using the 'subprocess' library would be better, but this is
            # simpler to illustrate the example.
            os.system('gsutil cp "%s" "%s"' % (filename, storage_bucket))

Alternatively, consider using Google Cloud Store Python API directly instead of shelling out to gsutil.