Flink streaming, how to do the counting?

Question

I came across a post Scaling Klaviyo’s Event processing Pipeline with Stream Processing, in the post, people in company called Klaviyo do the counting in different timeframes, hourly, daily, even monthly.

I have a couple of questions, if I understand it correctly, they're using timewindow, but is it normal to use timewindow for such long time, like a day?!

That doesn't make sense to me, if you're doing a daily or monthly counting, why not use batch processing? What is the fundamental benefit of using streaming in such case?

A different case, if I need to count the kafka event from the very beginning, in real time, what is the real-world solution? Use flink streaming to update a "counter" in redis every time an event arrives? If the kafka is quite busy, like several millions messages per second, wouldn't there be too much IO and network?

HungUnicorn HungUnicorn · Accepted Answer · 2019-04-05T16:02:13

That doesn't make sense to me, if you're doing a daily or monthly counting, why not use batch processing? What is the fundamental benefit of using streaming in such case?

For sure you can have the other batch processing. But how will you handle re-process? you have to restart the batch process and real-time process, and the data may not match, because you have two processes.

for aggregations in days, there is rocks-db solution so data will not explode the memory. (in KafkaStream the data is even stored back to Kafka).

if you need to update the counter whenever every event comes, the question would be does someone need to see the counter in 0.001 millis? because you can batch the streaming pipeline in 0.3 seconds that people at their best can perceive. That's why also people say near-realtime, which means it's not real-time, but it fulfills the needs already.

Flink streaming, how to do the counting?

1 Answers