To answer the question of 'why'
'At least once' delivery just means messages will be retried via some retry mechanism until successfully delivered (i.e. acknowledged). So if there's a failure or timeout then there's a retry.
By it's essence (retrying mechanism) this means you might occasionally have duplicates / more than once delivery. It's the same whether it's PubSub or GCS notifications delivering the message.
In the scenario you quote, you have:
- The Publisher (GCS notification) -- may send duplicates of GCS events to pubsub topic
- The PubSub topic messages --- may contain duplicates from publisher
- no deduplication as messages come in
- all messages assigned unique PubSub
message_id
even if they are duplicates of the same GCS event notification
- PubSub topic Subscription(s) --- may also send duplicates of messages to subscribers
With PubSub
Once a message is sent to a subscriber, the subscriber must either acknowledge or drop the message. A message is considered outstanding once it has been sent out for delivery and before a subscriber acknowledges it.
A subscriber has a configurable, limited amount of time, or ackDeadline, to acknowledge the message. Once the deadline has passed, an outstanding message becomes unacknowledged.
Cloud Pub/Sub will repeatedly attempt to deliver any message that has not been acknowledged or that is not outstanding.
Source: https://cloud.google.com/pubsub/docs/subscriber#at-least-once-delivery
With Google Cloud Storage
They need to do something similar internally to 'publish' the notification event from GCS to PubSub - so reason is essentially the same.
Why this matters
- You need to expect occasional duplicates originating from GCS notifications as well as the PubSub subscriptions
- The PubSub message id can be used to detect duplicates from the pubsub topic -> subscriber
- You have to figure out your own idempotent id/token to handle duplicates from the 'publisher' (the GCS notification event)
If you need to de-duplicate or achieve exactly once processing, you can then build your own solution utilising the idempotent ids/tokens or see if Cloud Dataflow can accommodate your needs.
You can achieve exactly once processing of Cloud Pub/Sub message streams using Cloud Dataflow PubsubIO. PubsubIO de-duplicates messages on custom message identifiers or those assigned by Cloud Pub/Sub.
Source: https://cloud.google.com/pubsub/docs/faq#duplicates
If interested in a more fundamental exploration of the why we see:
There is No Now - Problems with simultaneity in distributed systems