1
votes

In the docs about GCP Storage and Pub/Sub notification I find this sentence that is not really clear:

Cloud Pub/Sub also offers at-least-once delivery to the recipient [that's pretty clear], which means that you could receive multiple messages, with multiple IDs, that represent the same Cloud Storage event [why?]

Can anyone give a better explanation of this behavior?

Thanks!

3

3 Answers

2
votes

An at-least-once delivery means that the service must receive confirmation from the recipient to ensure that the message was received. In this case, we need some sort of timeout period in order to re-send the message. It is possible, due to network latency or packet loss, etc, to have the recipient send a confirmation, but the sender to not receive the confirmation before the timeout period, and therefore the sender will send the message again.

This is a common problem is network communications and distributed systems, and there are different types of messaging to address this issue.

4
votes

Google Cloud Storage uses at-least-once delivery to deliver your notifications To Cloud Pub/Sub. In other words, GCS will publish at least one message into Cloud Pub/Sub for each event that occurs.

Next, a Cloud Pub/Sub subscription will deliver the message to you, the end user, at least once.

So, say that in some rare case, GCS publishes two messages about the same event to Cloud Pub/Sub. Now that one GCS event has two Pub/Sub message IDs. Next, to make it even more unlikely, Pub/Sub delivers each of those messages twice. Now you have received 4 messages, with 2 message IDs, about the same single GCS event.

The important takeaway of the warning is that you should not attempt to dedupe GCS events by Pub/Sub message ID.

0
votes

To answer the question of 'why'

'At least once' delivery just means messages will be retried via some retry mechanism until successfully delivered (i.e. acknowledged). So if there's a failure or timeout then there's a retry.

By it's essence (retrying mechanism) this means you might occasionally have duplicates / more than once delivery. It's the same whether it's PubSub or GCS notifications delivering the message.

In the scenario you quote, you have:

  1. The Publisher (GCS notification) -- may send duplicates of GCS events to pubsub topic
  2. The PubSub topic messages --- may contain duplicates from publisher
    • no deduplication as messages come in
    • all messages assigned unique PubSub message_id even if they are duplicates of the same GCS event notification
  3. PubSub topic Subscription(s) --- may also send duplicates of messages to subscribers

With PubSub

Once a message is sent to a subscriber, the subscriber must either acknowledge or drop the message. A message is considered outstanding once it has been sent out for delivery and before a subscriber acknowledges it.

A subscriber has a configurable, limited amount of time, or ackDeadline, to acknowledge the message. Once the deadline has passed, an outstanding message becomes unacknowledged.

Cloud Pub/Sub will repeatedly attempt to deliver any message that has not been acknowledged or that is not outstanding.

Source: https://cloud.google.com/pubsub/docs/subscriber#at-least-once-delivery

With Google Cloud Storage

They need to do something similar internally to 'publish' the notification event from GCS to PubSub - so reason is essentially the same.


Why this matters

  • You need to expect occasional duplicates originating from GCS notifications as well as the PubSub subscriptions
  • The PubSub message id can be used to detect duplicates from the pubsub topic -> subscriber
  • You have to figure out your own idempotent id/token to handle duplicates from the 'publisher' (the GCS notification event)

If you need to de-duplicate or achieve exactly once processing, you can then build your own solution utilising the idempotent ids/tokens or see if Cloud Dataflow can accommodate your needs.

You can achieve exactly once processing of Cloud Pub/Sub message streams using Cloud Dataflow PubsubIO. PubsubIO de-duplicates messages on custom message identifiers or those assigned by Cloud Pub/Sub. Source: https://cloud.google.com/pubsub/docs/faq#duplicates

If interested in a more fundamental exploration of the why we see:

There is No Now - Problems with simultaneity in distributed systems