I have been reading about how DataFlow ack messages when reading data in streaming. Based on the answers here and here, seems like DataFlow 'ack' the messages by bundle, as long as it finishes the bundle, then it will 'ack' the messages in it.
The confusion n is what will happen when there is a GroupByKey
involved in the pipeline. The data in the bundle will be persisted to a multi-regional bucket and the messages will be acknowledged. Then imagine the whole region goes down. The intermediate data will still be in the bucket (because us multi-regional).
That being said,
- What are the steps to follow in order to not loose any data?
- Any recommendation around how to handle this active/active approach in order to not loose data when a region is completely down?
Please advise,