How to process files serially in cloud function?

Question

I have written a cloud storage trigger based cloud function. I have 10-15 files landing at 5 secs interval in cloud bucket which loads data into a bigquery table(truncate and load).

While there are 10 files in the bucket I want cloud function to process them in sequential manner i.e 1 file at a time as all the files accesses the same table for operation.

Currently cloud function is getting triggered for multiple files at a time and it fails in BIgquery operation as multiple files trying to access the same table.

Is there any way to configure this in cloud function??

Thanks in Advance!

Does all the file write in the same table? If no, can you differentiate the destination table according with a file prefix or a different path in GCS? How many files do you have per day? — guillaume blaquiere
Yes, we have a single table to load all the files which is a truncate load table. No, I can't create multiple tables as they will be again pointing to the same final table. We receive max 30 files a day but it may vary — Riti
Do the files have a specific order? Or, do you perform a query after your truncate load? — guillaume blaquiere
NO there is no specic order of receiving or loading the files. Yes, we perform our query and transformation activity after staging load. — Riti
Why do you need to process them sequentially? That complicates cloud architecture, and limits its scalability. To better understand the issue, read this: cloud.google.com/pubsub/docs/ordering — Doug Stevenson

guillaume blaquiere guillaume blaquiere · Accepted Answer · 2020-07-15T12:59:19

You can achieve this by using pubsub, and the max instance param on Cloud Function.

Firstly, use the notification capability of Google Cloud Storage and sink the event into a PubSub topic.
- Now you will receive a message every time that a event occur on the bucket. If you want to filter on file creation only (object finalize) you can apply a filter on the subscription. I wrote an article on this
Then, create an HTTP functions (http function is required if you want to apply a filter) with the max instance set to 1. Like this, only 1 function can be executed in the same time. So, no concurrency!
Finally, create a PubSub subscription on the topic, with a filter or not, to call your function in HTTP.

EDIT

Thanks to your code, I understood what happens. In fact, BigQuery is a declarative system. When you perform a request or a load job, a job is created and it works in background.

In python, you can explicitly wait the end on the job, but, with pandas, I didn't find how!!

I just found a Google Cloud page to explain how to migrate from pandas to BigQuery client library. As you can see, there is a line at the end

# Wait for the load job to complete.
job.result()

than wait the end of the job.

You did it well in the _insert_into_bigquery_dwh function but it's not the case in the staging _insert_into_bigquery_staging one. This can lead to 2 issues:

The dwh function work on the old data because the staging isn't yet finish when you trigger this job
If the staging take, let's say, 10 seconds and run in "background" (you don't wait the end explicitly in your code) and the dwh take 1 seconds, the next file is processed at the end of the dwh function, even if the staging one continue to run in background. And that leads to your issue.

How to process files serially in cloud function?

2 Answers