Simple task queue using Google Cloud Platform : issue with Google PubSub

Question

My task : I cannot speak openly about what the specifics of my task are, but here is an analogy : every two hours, I get a variable number of spoken audio files. Sometimes only 10, sometimes 800 or more. Let's say I have a costly python task to perform on these files, for example Automatic Speech Recognition. I have a Google Intance managed group that can deploy any number of VMs for executing this task.

The issue : right now, I'm using Google PubSub. Every two hours, a topic is filled with audio ids. Instances of the managed group can be deployed depending on the size of queue. The problem is, only one worker get all the messages from the PubSub subscription, while the others are not receiving any, perhaps because the queue is not that long (maximum ~1000 messages). This issue is reported in a few cases in the python Google Cloud github, and it is not clear if it is the intended purpose of PubSub, or just a bug.

How could I implement the equivalent of a simple serverless task queue in Python and Google Cloud, and can spawn instances based on a given metric, for example the size of the queue ? Is this the intended purpose of PubSub ?

Thanks in advance.

Marcel Marcel · Accepted Answer · 2018-05-23T14:15:07

In App Engine you can create push queues and set rate/concurrency limits and Google will handle the rest for you. App Engine will scale as needed (e.g. increase Python instances).

If you're outside of App Engine (e.g. GKE), the pubsub Python client library may be pulling many messages at once. We had a hard time controlling this (for google-cloud-pubsub==0.34.0) as well so we ended up writing a small adjustment on top of google-cloud-pubsub calling SubscriberClient.pull with max_messages set). The server side pubsub API does adhere to max_messages.

Simple task queue using Google Cloud Platform : issue with Google PubSub

1 Answers