My company receives both batch and stream based event data. I want to process the data using Google Cloud dataflow over a predictable time period. However, I realize that in some instances the data comes late or out of order. How to use Dataflow to handle late or out of order?
This is a homework question, and would like to know the only answer in below.
a. Set a single global window to capture all data
b. Set sliding window to capture all the lagged data
c. Use watermark and timestamps to capture the lagged data
d. Ensure every datasource type (stream or batch) has a timestamp, and use the timestamps to define the logic for lagged data.
My reasoning - I believe 'C' is the answer. But then, watermark is actually different from late data. Please confirm. Also, since the question mentioned both batch and stream based, i also think if 'D' could be the answer since 'batch'(or bounded collection) mode doesn't have the timestamps unless it comes from source or is programmatically set. So, i am a bit confused on the answer.
Please help here. I am a non-native english speaker, so not sure if I could have missed some cues in the question.