Deduplicate pubsub messages back to pubsub with dataflow possible?

Question

I have an application writing data to Google Cloud pubsub and as per the documentation of pubsub, duplicates due to retry mechanism is something that can happen once in a while. There is also the issue of out-of-order messages which is also not guaranteed in pubsub.

Also per documentation, it is possible to use Google Cloud Dataflow to deduplicate these messages.

I want to make those messages available in a messaging queue (meaning cloud pubsub) for services to consume and cloud Dataflow does seem to have a pubsubio writer however wouldn't you be getting back to the exactly the same problem where writing to pubsub can create duplicates? Wouldn't that also be the same issue with order? How can I stream messages in order using pubsub (or any other system for that matter)?

Is it possible to use cloud dataflow to read from a pubsub topic and write to another pubsub with guarantees of no duplicates? If not how else would you do this that supports streaming for a relatively small amount of data?

Also I am very new to Apache beam/Cloud Dataflow. How would such a simple use case look like? I suppose I can deduplicate using the ID generated by pubsub itself, as I am letting the pubsub library do its internal retry rather than do it myself so the ID should be the same on retries.

Kamal Aboul-Hosn Kamal Aboul-Hosn · Accepted Answer · 2019-03-12T00:07:57

It would not be possible to use Cloud Dataflow to read from a Pub/Sub topic and write to another Pub/Sub topic that can guarantee no duplicates. Duplicates can happen in one of two ways:

The publisher publishes the same message twice. From the Pub/Sub service's perspective, these are two individual messages and both will be delivered. This can happen if, for example, the publisher does a publish and it fails with a DEADLINE_EXCEEDED and the publisher retries. In this situation, it is possible that the first publish attempt actually did succeed, but the response was not delivered back to the publisher in time.
The Pub/Sub service delivers a message to the subscriber and either it does not acknowledge that message or that ack is lost in transit back to the service. Pub/Sub has at-least-once delivery guarantees. One main source of this is the fact that acks are best effort, meaning even if the subscriber sends an ack, it may not make it all the way back to the service, e.g., if there is a network interruption.

Given these two different modes of duplicates, the only way to dedupe messages is to do it in the final subscriber receiving the messages, either via Dataflow or some other mechanism, e.g., writing the IDs received down in a database. Note that in either case, it may not be sufficient to use the IDs generated by the Pub/Sub service since duplicate publishes are possible in the event of requests retried after an error.

Deduplicate pubsub messages back to pubsub with dataflow possible?

2 Answers