2
votes

I am woking on tensorflow model where I want to make use of the latest ulpoad object, in order get output from that uploaded object. Is there way to access latest object uploaded to Google cloud storage bucket using python.

2

2 Answers

3
votes

The below is what I use for grabbing the latest updated object.

Instantiate your client

from google.cloud import storage
# first establish your client
storage_client = storage.Client()

Define bucket_name and any additional paths via prefix

# get your blobs
bucket_name = 'your-glorious-bucket-name'
prefix = 'special-directory/within/your/bucket' # optional

Iterate the blobs returned by the client

Storing these as tuple records is quick and efficient.

blobs = [(blob, blob.updated) for blob in storage_client.list_blobs(
    bucket_name,
    prefix = prefix,
)]

Sort the list on the second tuple value

# sort and grab the latest value, based on the updated key
latest = sorted(blobs, key=lambda tup: tup[1])[-1][0]
string_data = latest.download_as_string()

Metadata key docs and Google Cloud Storage Python client docs.

One-liner

# assumes storage_client as above
# latest is a string formatted response of the blob's data
latest = sorted([(blob, blob.updated) for blob in storage_client.list_blobs(bucket_name, prefix=prefix)], key=lambda tup: tup[1])[-1][0].download_as_string()
1
votes

There is no a direct way to get the latest uploaded object from Google Cloud Storage. However, there is a workaround using the object's metadata.

Every object that it is uploaded to the Google Cloud Storage has different metadata. For more information you can visit Cloud Storage > Object Metadata documentation. One of the metadatas is "Last updated". This value is a timestamp of the last time the object was updated. Which can happen only in 3 occasions:

A) The object was uploaded for the first time.

B) The object was uploaded and replaced because it already existed.

C) The object's metadata changed.

If you are not updating the metadata of the object, then you can use this work around:

  1. Set a variable with very old date_time object (1900-01-01 00:00:00.000000). There is no chance of an object to have this update metadata.
  2. Set a variable to store the latest's blob's name and set it to "NONE"
  3. List all the blobs in the bucket Google Cloud Storage Documentation
  4. For each blob name load the updated metadata and convert it to date_time object
  5. If the blob's update metadata is greater than the one you have already, then update it and save the current name.
  6. This process will continue until you search all the blobs and only the latest one will be saved in the variables.

I have did a little bit of coding my self and this is my GitHub code example that worked for me. Take the logic and modify it based on your needs. I would also suggest to test it locally and then use it in your code.

BUT, in case you update the blob's metadata manually then this is another workaround:

If you update the blob's any metadata, see this documentation Viewing and Editing Object Metadata, then the "Last update" timestamp of that blob will also get updated so running the above method will NOT give you the last uploaded object but the last modified which are different. Therefore you can add a custom metadata to your object every time you upload and that custom metadata will be the timestamp at the time you upload the object. So no matter what happen to the metadata later, the custom metadata will always keep the time that the object was uploaded. Then use the same method as above but instead of getting blob.update get the blob.metadata and then use that date with the same logic as above.

Additional notes:

To use custom metadata you need to use the prefix x-goog-meta- as it is stated in Editing object metadata section in Viewing and Editing Object Metadata documentation.

So the [CUSTOM_METADATA_KEY] should be something like x-goog-meta-uploaded and [CUSTOM_METADATA_VALUE] should be [CURRENT_TIMESTAMP_DURING_UPLOAD]