Consider the following setup:
- Pub/Sub
- Dataflow: streaming job for validating events from Pub/Sub, unpacking and writing to BigQuery
- BigQuery
We have counters on the valid events that pass through our Datafow Pipeline and observe the counters are higher than the amount of events that were available in Pub/Sub.
Note: It seems we also see duplicates in BigQuery, but we are still investigating this.
The following error can be observed in the Dataflow logs:
Pipeline stage consuming pubsub took 1h35m7.83313078s and default ack deadline is 5m.
Consider increasing ack deadline for subscription projects/<redacted>/subscriptions/<redacted >
Note that the Dataflow job is started when there are already millions of messages waiting in Pub/Sub.
Questions:
- Can this cause duplicate events to be picked up by the pipeline?
- Is there anything we can do to alleviate this issue?