0
votes

My company receives both batch and stream based event data. I want to process the data using Google Cloud dataflow over a predictable time period. However, I realize that in some instances the data comes late or out of order. How to use Dataflow to handle late or out of order?

This is a homework question, and would like to know the only answer in below.

a. Set a single global window to capture all data

b. Set sliding window to capture all the lagged data

c. Use watermark and timestamps to capture the lagged data

d. Ensure every datasource type (stream or batch) has a timestamp, and use the timestamps to define the logic for lagged data.

My reasoning - I believe 'C' is the answer. But then, watermark is actually different from late data. Please confirm. Also, since the question mentioned both batch and stream based, i also think if 'D' could be the answer since 'batch'(or bounded collection) mode doesn't have the timestamps unless it comes from source or is programmatically set. So, i am a bit confused on the answer.

Please help here. I am a non-native english speaker, so not sure if I could have missed some cues in the question.

2

2 Answers

2
votes
How to use Dataflow to handle late or out of order

This is a big question. I will try to give some simple explanations but provide some resources that might help you understand.

Bounded data collection

You have gotten a sense of it: bounded data does not have lateness problem. By the nature of bounded data, you can read the full data set at once before pipeline starts.

Unbounded data collection

Your C is correct, and watermark is different from late data. Watermark in implementation is a monotonically increasing timestamp. When Beam/Dataflow see a record with a event timestamp that is earlier than the watermark, the record is treated as late data (this is only conceptual and you might want to check[1] for some detailed discussion).

Here are [2], [3], [4] as reference for this topic:

  1. https://docs.google.com/document/d/12r7frmxNickxB5tbpuEh_n35_IJeVZn1peOrBrhhP6Y/edit#heading=h.7a03n7d5mf6g

  2. https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

  3. https://www.oreilly.com/library/view/streaming-systems/9781491983867/

  4. https://docs.google.com/presentation/d/1ln5KndBTiskEOGa1QmYSCq16YWO9Dtmj7ZwzjU7SsW4/edit#slide=id.g19b6635698_3_4

0
votes

B and C may be the answer.

With sliding windows, you have the order of the data, so if you recive the data in position 9 and you don't recive the data in the position 8, you know that data 8 is delayed and wait for it. The problem is, if the latest data is delayed, you can't know this data is delayed and you lost it. https://en.wikipedia.org/wiki/Sliding_window_protocol

Watermark, wait a period of time for the lagged data, if this time passes and the data doesn't arrive, you lose this data.

So, the answer is C, because B says "capture all the lagged data" and C ignores the word all