I would like to use Google Cloud Dataflow to create session windows as explained in the dataflow model paper. I would like to send my unbound data to Pub/Sub, then read it in Cloud Dataflow in a streaming way. I want to use session windows with big timeout (30 min to 120 min).
My questions are:
1) What happens if the dataflow process fails?
2) Do I lose all data stored in windows that still have not timed out?
3) What recovery mechanisms does Dataflow provide?
Example:
Let's say I have a Sessions window with 30 min timeout that triggers every minute processing time with accumulation. Let's say the value is an integer and I am just summing all values in a window. Let's say that these key-value pairs are coming from Pub/Sub:
7 -> 10 (at time 0 seconds)
7 -> 20 (at time 30 seconds)
7 -> 50 (at time 65 seconds)
7 -> 60 (at time 75 seconds)
I suppose that at time 60 seconds the window would trigger and it will produce a 7 -> 30
pair. I also suppose that at time 120 seconds the window would trigger again and it will produce a 7 -> 140
pair since it triggers with accumulation.
My question is what happens if at time 70 Dataflow fails? I suppose that the 3 messages received before the 70-th second would have already been acked to Pub/Sub, so they won't be redelivered.
When Dataflow restarts, would it somehow restore the state of the window with key 7 so that at time 120 seconds it can produce a 7 -> 140
pair, or it will just produce a 7 -> 60
pair?
Also a related question - if I cancel the dataflow job and start a new one, I suppose the new one would not have the state of the previous job. Is there a way to transfer the state to the new job?