0
votes

Let's say we want to compute the sum and average of the items, and can either working with states or windows(time).

Example working with windows - https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html#example-program

Example working with states - https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/java/com/dataartisans/flinktraining/exercises/datastream_java/ride_speed/RideSpeed.java

Can I ask what would be the reasons to make decision? Can I infer that if the data arrives very irregularly (50% comes in the defined window length and the other 50% don't), the result of the window approach is more biased (because the 50% events are dropped)?

On the other hand, do we spend more time checking and updating the states when working with states?

1
I would recommend to read this blog post about Flink's windows.Fabian Hueske
This article is very clear for working with windows and comparing windows with states is still confused for me. How it can have the same effect as evictor in windows? Why it doesn't have to defined an interval for unbounded stream? the example working with states also follows the concept of event time, comparing to the event time with windows what would be the differences?HungUnicorn
The example for working with state assumes that each key appears only twice in the stream (a start ride and an end ride event). It also does not compute a "true" average aggregate as the name suggests. Instead it computes the average speed by dividing distance by time (and not averaging several speed values). Hence, this example is not very generic. If you need a proper aggregate (sum, avg, count), you should go with a window.Fabian Hueske

1 Answers

1
votes

First, it depends on your semantics... The two examples use different semantics and are thus not comparable directly. Furthermore, windows work with state internally, too. It is hard to say in general with approach is the better one.

As Flink's window semantics are very rich, I would suggest to use windows. If you cannot express your semantics with windows, using state can be a good alternative. Using windows, has the additional advantage that state handling---which is hard to get done right---is done automatically for you.

The decision is definitely independent from your data arrival rate. Flink does not drop any data. If you work with event time (rather than with processing time) your result will be the same independently of the data arrival rate after all.