4
votes

I would like to use Google Cloud Dataflow to create session windows as explained in the dataflow model paper. I would like to send my unbound data to Pub/Sub, then read it in Cloud Dataflow in a streaming way. I want to use session windows with big timeout (30 min to 120 min).

My questions are:

1) What happens if the dataflow process fails?

2) Do I lose all data stored in windows that still have not timed out?

3) What recovery mechanisms does Dataflow provide?

Example:

Let's say I have a Sessions window with 30 min timeout that triggers every minute processing time with accumulation. Let's say the value is an integer and I am just summing all values in a window. Let's say that these key-value pairs are coming from Pub/Sub:

7 -> 10 (at time 0 seconds)
7 -> 20 (at time 30 seconds)
7 -> 50 (at time 65 seconds)
7 -> 60 (at time 75 seconds)

I suppose that at time 60 seconds the window would trigger and it will produce a 7 -> 30 pair. I also suppose that at time 120 seconds the window would trigger again and it will produce a 7 -> 140 pair since it triggers with accumulation.

My question is what happens if at time 70 Dataflow fails? I suppose that the 3 messages received before the 70-th second would have already been acked to Pub/Sub, so they won't be redelivered.

When Dataflow restarts, would it somehow restore the state of the window with key 7 so that at time 120 seconds it can produce a 7 -> 140 pair, or it will just produce a 7 -> 60 pair?

Also a related question - if I cancel the dataflow job and start a new one, I suppose the new one would not have the state of the previous job. Is there a way to transfer the state to the new job?

1

1 Answers

4
votes

Cloud Dataflow handles failures transparently. E.g. it will only "ack" messages in Cloud Pubsub after they have been processed and the results durably committed. If the Dataflow process fails (I'm assuming you're referring, say, to a crash of a worker JVM which would then be automatically restarted, rather than complete failure of the whole job), on restart it will connect to Pubsub again and all non-acked messages will be redelivered and reprocessed, including grouping into windows etc. Window state is also durably preserved across failures, so in this case it should produce 7 -> 140.

If you are interested in the implementation of this persistence, please see the Millwheel paper - it predates Dataflow, but Dataflow uses the same persistence layer in the streaming runner.

There are no user-facing recovery mechanisms in Dataflow because the programming model isolates you from the necessity to handle failures, and the runner takes care of all necessary recovery; the only way in which failures are visible is via the fact that records can be processed multiple times, i.e. if you perform any side effects in your DoFn's, those side effects must be idempotent.

Currently the only case where state transfer happens between job is during the pipeline update operation.