0
votes

I have a very big folder in Google Cloud Storage and I am currently deleting the folder with the following django - python code while using Google App Engine within a 30 seconds default http timeout.

def deleteStorageFolder(bucketName, folder):
    from google.cloud import storage
    cloudStorageClient = storage.Client()
    bucket = cloudStorageClient.bucket(bucketName)
    logging.info("Deleting : " + folder)
    try:
        bucket.delete_blobs(blobs=bucket.list_blobs(prefix=folder))
    except Exception as e:
        logging.info(str(e.message))

It is really unbelievable that Google Cloud is expecting the application to request the information for the objects inside the folder one by one and then delete them one by one.

Obviously, this fails due to the timeout. What would be the best strategy here ?

(There should be a way that we delete the parent object in the bucket, it should delete all the associated child objects somewhere in the background and we remove the associated data from our model. Then Google Storage is free to delete the data whenever it wants. Yet, per my understanding, this is not how things are implemented)

2
Google Cloud Storage does not have folders. The concept of parent/child is emulated in software. You could list the objects in a folder and then spin off threads to delete each object in parallel. - John Hanley
Yes, Google Cloud Storage have the concept of objects for files and directories. In Google App Engine by design you have a 30 seconds time out for a Django Application. Creating threads to delete each object in parallel will not solve this which will still get into a timeout, but I think I can do this in Google App Engine Flex. - SuperEye
The namespace for Google Cloud Storage is FLAT. Directories do not exist. Everything is an object in the root folder. The HTTP Request timeout for App Engine is 10 minutes, therefore the limitation of 30 seconds is in your application. - John Hanley

2 Answers

2
votes

2 simple options in my mind until the client library supports deleting in batch - see https://issuetracker.google.com/issues/142641783 :

  1. if the GAE image includes the gsutil cli, you could execute gsutil -m rm ... in a subprocess
  2. my favorite, use gcsfs library instead of the G library. It supports batch-deleting by default - see https://gcsfs.readthedocs.io/en/latest/_modules/gcsfs/core.html#GCSFileSystem.rm
1
votes

There is a workaround. You can do this in 2 steps

  1. "Move" your file to delete into another bucket with Transfert enter image description here

Create a transfert from your bucket, with the filters that you want to another bucket (create a temporary one if needed). Check "delete from source after transfer" checkbox enter image description here

  1. After the successful transfer, delete the temporary bucket. If it's too long, you have another workaround.

    • Go to bucket page
    • Click on lifecycle enter image description here
    • Set up a lifecycle where you delete file with age > 0 day enter image description here

In both cases, you rely on Google Cloud batch feature because by yourselves is too, too, too long!