3
votes

The issue I'm facing is that Cloud Storage sorts newly added files lexicographically (Alphabetical Order) while I'm reading a file placed at index 0 in Cloud Storage bucket using its python client library in Cloud Functions (using cloud function is must as a part of my project) and put the data in BigQuery which is working fine for me but the newly added file do not always appear at index 0.

The streaming files enter in my bucket every day at different times. The filename is same (data-2019-10-18T14_20_00.000Z-2019-10-18T14_25_00.txt) but the date and time field in file name differ in every newly added file.

How can I adjust this python code to read the latest added file in Cloud Storage bucket every time the cloud function is triggered?

files = bucket.list_blobs()
fileList = [file.name for file in files if '.' in file.name]
blob = bucket.blob(fileList[0])   #reading file placed at index 0 in bucket
2

2 Answers

5
votes

If the Cloud Function you have is triggered by HTTP then you could substitute it with one that uses Google Cloud Storage Triggers. If it was already then you only need to take advantage of it.

Any time the function is triggered, you could check for the event type and do whatever with the data, like:

from google.cloud import storage

storage_client = storage.Client()

def hello_gcs_generic(data, context):
    """Background Cloud Function to be triggered by Cloud Storage.
       check more in https://cloud.google.com/functions/docs/calling/storage#functions-calling-storage-python
    """

    if context.event_type == storage.notification.OBJECT_FINALIZE_EVENT_TYPE:

        print('Created: {}'.format(data['timeCreated'])) #this here for illustration purposes
        print('Updated: {}'.format(data['updated']))

        blob = storage_client.get_bucket(data['bucket']).get_blob(data['name']) 

        #TODO whatever else needed with blob

This way, you don't care about when the object was created. You know that when is created your client library code fetches the correspondent blob and you do whatever you want with it.

1
votes

If your goal is to process each and every one (or most) of the uploaded files @fhenrique's answer is a better approach.

But if your processing is rather sparsely in comparison with the rate at which the files are uploaded (or simply if your requirement doesn't allow you to switch to the suggested Cloud Storage trigger) then you need to take a closer look at why your expectation to find the most recently uploaded file in the index 0 position is not met.

The first reason that comes to mind is your file naming convention. For example let's assume 2 such files: data-2019-10-18T14_20_00.000Z-2019-10-18T14_25_00.txt and data-2019-10-18T14_25_00.000Z-2019-10-18T14_30_00.txt. Their lexicographic order would be:

['data-2019-10-18T14_20_00.000Z-2019-10-18T14_25_00.txt',
 'data-2019-10-18T14_25_00.000Z-2019-10-18T14_30_00.txt']

Note that the most recently uploaded file is actually the last one in the list, not the first one. So all you'd have to do is to replace index 0 with index -1.

A few other possible things/reasons to consider (try printing fileList to confirm/deny these theories):

  • the file you expect to find in the index -1 position isn't actually completely uploaded and finalized. I'm unsure if there is anything you can do in this case - it's simply a matter of managing expectations

  • the list of files returned isn't actually lexicographically sorted (for whatever reason). I see the sorting being mentioned at Listing Objects, but not at the Storage Client API documentation. Explicitly sorting fileList before picking the file at index -1 should take care of that, if needed.

  • having files in that bucket which do not follow the mentioned naming rule (for whatever reason) - any such file with a name positioning it after the more recently uploaded file will completely break your algorithm going forward. To protect against such case you could use the prefix and maybe the delimiter optional arguments to bucket.list_blobs() to filter the results as needed. From the above-mentioned API doc:

  • prefix (str) – (Optional) prefix used to filter blobs.

  • delimiter (str) – (Optional) Delimiter, used with prefix to emulate hierarchy.

Such filtering can also be useful to limit the number of entries you get in the list based on the current date/time, which might significantly speedup your function execution, especially if there are many such files uploaded (your naming suggestion suggests there can be a whole lot of them)