How does Dataflow handle pubsub acknowledgement if Dataflow were to fail?

Question

I'm trying to understand the process in which Dataflow will acknowledge pubsub messages, and what kind of guarantees I have for processing all data regardless of if there is a failure.

I understand that Dataflow will ack a message when it saves it to some sort of persistent storage, but I'm not quite sure exactly when that would be.

Take for example a simple pipeline, reads messages from Pubsub, does a small transform on the message type to convert to something easily writable (a pardo), and saves to a text file in GCS. From a StackDriver dashboard, it seems like Dataflow is Acking messages as soon as they enter the pipeline and only gets backed up when writing the last window of files.

With this, I know when an error occurs with a message a streaming Dataflow job will continue running until the message works, or the pipeline is updated as mentioned here. However, due to the need for reliability in storing messages, what happens in the scenario where Dataflow itself or Beam runs into an internal error causing the pipeline to crash. If the messages are written to some sort of persistent storage (not my end GCS bucket), will a new pipeline be able to pick these up?

TLDR: What happens in the case of Dataflow itself failing completely. Will these messages that seem to get acked as they come in be lost or will they be picked up by a replacement?

Note: I read the answer given here but this seems to be talking about the failure case a step before a complete failure.

chamikara chamikara · Accepted Answer · 2019-04-04T18:59:06

Streaming Dataflow will retry failed workitems, so if a worker fails for some reason, Dataflow will retry the same work and should pick up from the point of failure without data loss.

As described in the previous answer you mentioned, currently there is no way to transfer state between two pipelines (unless it's an update), so if a pipeline fails completely (which should be very rare) and you start a new pipeline, the second pipeline will pick up from the last un-acked message from the PubSub topic.

How does Dataflow handle pubsub acknowledgement if Dataflow were to fail?

1 Answers