0
votes

We are using Google PubSub in a 'spiky' fashion where we publish millions of small messages (< 10k) in a short time (~ 10 mins), spin up 2k GKE pods with 10 worker threads each that use synchronous pull and acknowledge PubSub service calls to work through the associated subscription (with a 10 minute acknowledgement deadline). The Stack Driver graph for the subscription backlog will show a spike to 10M messages and then a downward slope to 0 in around 30 minutes (see below).
We noticed an increase of message re-delivery as the size of these backlogs grew from 1M to 10M from below 1% to beyond 10% for certain hours.

Coming from the GAE Task Pull queue world, we assumed that a worker would "lease" a message by pulling a message from the PubSub subscription where, starting at the time of pull, a worker would have 10 minutes to acknowledge to message. What appears to be happening however, after adding logging (see below for example of a re-published message), is that it is not the time from pull to ack that matters, but the time from publishing the message to acknowledgement.

Is this the right understanding of PubSub acknowledgement deadline, and subsequent redelivery behavior?

If so, should we be making sure the subscription's message backlog should only grow to a size that worker threads are able to process and acknowledge within the time configured for the subscription's acknowledgement deadline to get re-delivery rates to < 0.1% on average? We can probably have the publisher apply some sort of back-pressure based on the subscription backlog size although the GAE Pull Task Queue leasing behavior seems more intuitive.

Also, the wording in https://cloud.google.com/pubsub/docs/subscriber#push-subscription, under "Pull subscription": "The subscribing application explicitly calls the pull method, which requests messages for delivery" seems to imply that the acknowledgment timeout starts after the client pull call returns a given message?

Note: we use the Python PubSub API (google-cloud-pubsub), although not the default streaming behavior as this caused "message hoarding" as described in the PubSub docs given the large amount of small messages we publish. Instead we call subscriber_client.pull and acknowledge (which seems thin wrappers around the PubSub service API calls)

PullMessage.ack: 1303776574755856 delay from lease: 0:00:35.032463 (35.032463 seconds), publish: 0:10:02.806571 (602.806571 seconds)

enter image description here

1

1 Answers

2
votes

The ack deadline is for the time between Cloud Pub/Sub sending a message to a subscriber and receiving an ack call for that message. (It is not the time between publishing the message and acking it.) With raw synchronous pull and acknowledge calls, subscribers are responsible for managing the lease. This means that without explicit calls to modifyAckDeadline, the message must be acked by the ack deadline (which defaults to 10 seconds, not 10 minutes).

If you use one of the Cloud Pub/Sub client libraries, received messages will have their leases extended automatically. The behavior for how this lease management works depends on the library. In the Python client library, for example, leases are extended based on previous messages' time-to-ack.

There are many reasons for message redelivery. It's possible that as the backlog increases, load to your workers increases, increasing queuing time at your workers and the time taken to ack messages. You can try increasing your worker count to see if this improves your redelivery rate for large backlogs. Also, the longer it takes for messages to be acked, the more likely they are to be redelivered. The server could lose track of them and deliver them once again.

There is one thing you could do on the publish side to reduce message redeliveries - reduce your publish batch size. Internally, ack state is stored per batch. So, if even one message in a batch exceeds the ackDeadline, they may all be redelivered.

Message redelivery may happen for many other reasons, but scaling your workers could be a good place to start. You can also try reducing your publish batch size.