Duplicates in Pub/Sub-Dataflow-BigQuery pipeline

Question

Consider the following setup:

Pub/Sub
Dataflow: streaming job for validating events from Pub/Sub, unpacking and writing to BigQuery
BigQuery

We have counters on the valid events that pass through our Datafow Pipeline and observe the counters are higher than the amount of events that were available in Pub/Sub.

Note: It seems we also see duplicates in BigQuery, but we are still investigating this.

The following error can be observed in the Dataflow logs:

Pipeline stage consuming pubsub took 1h35m7.83313078s and default ack deadline is 5m. 
Consider increasing ack deadline for subscription projects/<redacted>/subscriptions/<redacted >

Note that the Dataflow job is started when there are already millions of messages waiting in Pub/Sub.

Questions:

Can this cause duplicate events to be picked up by the pipeline?
Is there anything we can do to alleviate this issue?

guillaume blaquiere guillaume blaquiere · Accepted Answer · 2021-03-11T13:17:20

My recommendation is to purge the PubSub subscription message queue by launching the Dataflow job in batch mode. Then run it in streaming mode for the usual operation. Like this, you will be able to start from a clean basis your streaming job and not have a long list of enqueued messages.

In addition, it's the power of Dataflow (and beam) to be able to run in streaming and in batch the same pipeline.

Duplicates in Pub/Sub-Dataflow-BigQuery pipeline

1 Answers