We are using Google PubSub in a 'spiky' fashion where we publish millions of small messages (< 10k) in a short time (~ 10 mins), spin up 2k GKE pods with 10 worker threads each that use synchronous pull and acknowledge PubSub service calls to work through the associated subscription (with a 10 minute acknowledgement deadline).
The Stack Driver graph for the subscription backlog will show a spike to 10M messages and then a downward slope to 0 in around 30 minutes (see below).
We noticed an increase of message re-delivery as the size of these backlogs grew from 1M to 10M from below 1% to beyond 10% for certain hours.
Coming from the GAE Task Pull queue world, we assumed that a worker would "lease" a message by pulling a message from the PubSub subscription where, starting at the time of pull, a worker would have 10 minutes to acknowledge to message. What appears to be happening however, after adding logging (see below for example of a re-published message), is that it is not the time from pull to ack that matters, but the time from publishing the message to acknowledgement.
Is this the right understanding of PubSub acknowledgement deadline, and subsequent redelivery behavior?
If so, should we be making sure the subscription's message backlog should only grow to a size that worker threads are able to process and acknowledge within the time configured for the subscription's acknowledgement deadline to get re-delivery rates to < 0.1% on average? We can probably have the publisher apply some sort of back-pressure based on the subscription backlog size although the GAE Pull Task Queue leasing behavior seems more intuitive.
Also, the wording in https://cloud.google.com/pubsub/docs/subscriber#push-subscription, under "Pull subscription": "The subscribing application explicitly calls the pull method, which requests messages for delivery" seems to imply that the acknowledgment timeout starts after the client pull call returns a given message?
Note: we use the Python PubSub API (google-cloud-pubsub), although not the default streaming behavior as this caused "message hoarding" as described in the PubSub docs given the large amount of small messages we publish. Instead we call subscriber_client.pull and acknowledge (which seems thin wrappers around the PubSub service API calls)
PullMessage.ack: 1303776574755856 delay from lease: 0:00:35.032463 (35.032463 seconds), publish: 0:10:02.806571 (602.806571 seconds)
