2
votes

I have an application writing data to Google Cloud pubsub and as per the documentation of pubsub, duplicates due to retry mechanism is something that can happen once in a while. There is also the issue of out-of-order messages which is also not guaranteed in pubsub.

Also per documentation, it is possible to use Google Cloud Dataflow to deduplicate these messages.

I want to make those messages available in a messaging queue (meaning cloud pubsub) for services to consume and cloud Dataflow does seem to have a pubsubio writer however wouldn't you be getting back to the exactly the same problem where writing to pubsub can create duplicates? Wouldn't that also be the same issue with order? How can I stream messages in order using pubsub (or any other system for that matter)?

Is it possible to use cloud dataflow to read from a pubsub topic and write to another pubsub with guarantees of no duplicates? If not how else would you do this that supports streaming for a relatively small amount of data?

Also I am very new to Apache beam/Cloud Dataflow. How would such a simple use case look like? I suppose I can deduplicate using the ID generated by pubsub itself, as I am letting the pubsub library do its internal retry rather than do it myself so the ID should be the same on retries.

2

2 Answers

3
votes

It would not be possible to use Cloud Dataflow to read from a Pub/Sub topic and write to another Pub/Sub topic that can guarantee no duplicates. Duplicates can happen in one of two ways:

  1. The publisher publishes the same message twice. From the Pub/Sub service's perspective, these are two individual messages and both will be delivered. This can happen if, for example, the publisher does a publish and it fails with a DEADLINE_EXCEEDED and the publisher retries. In this situation, it is possible that the first publish attempt actually did succeed, but the response was not delivered back to the publisher in time.

  2. The Pub/Sub service delivers a message to the subscriber and either it does not acknowledge that message or that ack is lost in transit back to the service. Pub/Sub has at-least-once delivery guarantees. One main source of this is the fact that acks are best effort, meaning even if the subscriber sends an ack, it may not make it all the way back to the service, e.g., if there is a network interruption.

Given these two different modes of duplicates, the only way to dedupe messages is to do it in the final subscriber receiving the messages, either via Dataflow or some other mechanism, e.g., writing the IDs received down in a database. Note that in either case, it may not be sufficient to use the IDs generated by the Pub/Sub service since duplicate publishes are possible in the event of requests retried after an error.

3
votes

Cloud Dataflow / Apache Beam are mac trucks. They are designed for parallelization of large data sources / streams. You can send huge amounts of data to PubSub but detecting duplicates is not a job for Beam as this task needs to be serialized.

Reading PubSub and then writing to a different topic does not remove the issue of duplicates as duplicates can happen on the new topic that you are writing to. Also, parallelization of queue writes further increases your issue of out of order messages.

The problem with duplicates needs to be solved on the client side that reads from the subscription. A simple database query can let you know that an item has already been processed. Then you just discard the message.

Handling out of sequence messages must be designed into your application also.

PubSub is designed to be a lightweight inexpensive message queue system. If you need guaranteed message ordering, no duplicates, FIFO, etc. you will need to use a different solution which of course is much more expensive.