1
votes

Our application is running on Google Kubernetes Engine and pulling messages from a Google Cloud Pub/Sub Subscription. We have one pod running in idle state, and horizontal pod autoscaling is set up to 10 pods depending on cpu usage. The subscription is mostly empty, and when a batch job kicks in, it writes into to Pub/Sub topic. The autoscaling is working well. It immediately (within 30 seconds) scales up to 10 pods once there are outstanding messages in the Pub/Sub subscription. But the issue is only a few pods are pulling the messages from the subscription and rest of them are just sitting even though there are still messages in the subscription.

Pub/Sub Client settings are:

MaxExtension: 600
MaxOutstandingMessages: 100 (also tried with 25)
Synchronous: true (also tried with false)

Google Cloud Pub/Sub Subscription Settings:

Pull-based
Ack Deadline is 600 seconds

And once the batch job kicks in, it writes 20k messages into the Pub/Sub topic. And the application can process 2 messages/sec in average.

The application is written in golang and we're using cloud.google.com/go v0.44.1 package version.

Do you know why the pods are sitting and not pulling messages even though there's a backlog in the Cloud Pub/Sub subscription?

1
How many pods are sitting? in average?guillaume blaquiere
Most of the time 8 out of 10 are sitting. In the worst case, only one pod is running, rest are waiting. When I kill the one which is consuming messages, the others start to consume messages.mdtp
Is it possible that all your clients have the same subscription ID? Please add an minimal reproducible example to your question by editing it.Markus W Mahlberg
@mdtp did you find the answer? @Markus I have created one subscription on GCP. My service is subscribing to that subscription and calling recievecode muncher

1 Answers

0
votes

Can you try to set this:

    sub.ReceiveSettings.NumGoroutines = 10 * runtime.NumCPU()
    sub.ReceiveSettings.MaxOutstandingMessages = -1
    sub.ReceiveSettings.MaxOutstandingBytes = -1

Maybe by removing some limit, it could be better?

let me know