2
votes

I have been looking through the source and documentation for google dataflow and I didn't see any mention of the message delivery semantics around PubSubIO.Read.

The problem I am trying to understand is: What kind of message delivery semantics does the PubSubIO and Google Dataflow provide? Based on my reading of the source, the messages get acked before they are emitted using ProcessingContext#output method. This implies that the Dataflow streaming job will loose messages that have been acked and not passed on.

So, how does Dataflow guarantee (if at all) correctness around windows (especially session), etc in case of failure and redeploy of jobs.

1

1 Answers

3
votes

Dataflow doesn't ack messages to Pub/Sub until they have been persisted in intermediate storage within the pipeline (or emitted to the sink, if there is no GroupByKey within the pipeline). We also do deduping of messages read from Pub/Sub for a short period to prevent duplicate delivery from missed acks. So Dataflow guarantees exactly once delivery, modulo any duplicates inserted by publishers at drastically different times.

Any intermediate state buffered within a running pipeline is maintained when the pipeline is updated. Streaming pipelines do not fail -- instead they continue to retry elements with errors. Either the error is transient and the element will eventually be processed successfully, or in the case of a consistent exception (NullPointerException in your code, etc) you can update the job with corrected code that will be used to process the failing element.

(Note that the implementation is different for the DirectRunner, which may be causing confusion if you are looking at that part of the code.)