2
votes

I'm evaluating Flink specifically for the streaming window support for possible alert generation. My concern is the memory usage so if someone could help with this it would be appreciated.

For example, this application will be consuming potentially a significant amount of data from the stream within a given tumbling window of say 5 minutes. At the point of evaluation, if there were say a million documents for example that matched the criteria, would they all be loaded into memory?

The general flow would be:

producer -> kafka -> flinkkafkaconsumer -> table.window(Tumble.over("5.minutes").select("...").where("...").writeToSink(someKafkaSink)

Additionally, if there is some clear documentation that describes how memory is being dealt with in these cases that I may have overlooked that someone could out that would be helpful.

Thanks

1

1 Answers

4
votes

The amount of data that is stored for a group window aggregation depends on the type of the aggregation. Many aggregation functions such as COUNT, SUM, and MIN/MAX can be preaggregated, i.e., they only need to store a single value per window. Other aggregation functions, such as MEDIAN or certain user-defined aggregation functions, need to store all values before they can compute their result.

The data that needs to be stored for an aggregation is stored in a state backend. Depending on the choice of the state backend, the data might be stored in-memory on the JVM heap or on disk in a RocksDB instance.

Table API queries are also optimized by a relational optimizer (based on Apache Calcite) such that filters are pushed as far towards the sources as possible. Depending on the predicate, the filter might be applied before the aggregation.

Finally, you need to add a groupBy() between window() and select() in your example query (see the examples in the docs).