2
votes

I'm experimenting with using Cloud Functions as async background worker triggered by PubSub and doing a bit longer work (in order of minutes). The complete code is here https://github.com/zdenulo/cloud-functions-pubsub

My prototype inserts data into BigQuery and waits for a few minutes (to mimic longer task). I am publishing 100 messages to PubSub topic (with 1 second interval).

It's emphasized that PubSub can deliver more than once the same message, but I was surprised that from 10 to 40 out of 100 are duplicated. Response time for CF was 5, 6, 7 minutes. For 4 minutes response, I didn't notice duplicates.
I've done multiple tests for the same time intervals. Time difference between receiving first and second message ranges from ~30 to ~600 seconds.

In documentation https://cloud.google.com/pubsub/docs/troubleshooting is mentioned "Cloud Pub/Sub can send duplicate messages. For instance, when you do not acknowledge a message before its acknowledgement deadline has expired, Cloud Pub/Sub resends the message." For Cloud Functions Subscription, acknowledge deadline is 600 seconds (10 minutes), so based on my understanding that shouldn't be the reason.

Maybe the test case I have is specific or maybe there is something else.
I would be grateful for advice on how to handle such a situation and if this is normal or how to do it to prevent duplicates (excluding Dataflow).

1
Cloud Functions can duplicate events as well, for any kind of trigger, so your functions really should expect to receive duplicates by being idempotent.Doug Stevenson
All triggers except HTTP trigger. As explained in the docs, HTTP functions are invoked at most once, while background functions (pubsub or any other trigger) are invoked at least once.Jofre
thanks for the comments and clarification.zdenulo
There is a good comment here - cloud.google.com/pubsub/docs/faq - titled "How do I detect duplicate messages?". I think a common technique is to use a cheap global data store (redis/memcache) and save the message_id of each message that is processed. Before you process a new message, check that you haven't seen it before in the cache.Kolban
thanks, @Kolban. Redis/memcache approach should work, but for non-frequent, small usage, that could be a bit overkill. It always depends on the use case I guess. I'm just surprised that I'm seeing a high percentage of duplicates.zdenulo

1 Answers

0
votes

There is an issue impacting Cloud Functions deployed before January 2019 that causes an increased rate of duplicate triggers for functions that take more than 5 minutes to run. Please try deleting and re-deploying your function to resolve the issue.