4
votes

We plan to use Apache Flink to perform real-time aggregations on multiple types of objects. We need to support several types of aggregations such as sum, max, min, average etc. - nothing special so far

Our requirement is to output the data to kafka where one message contains multiple aggregated values for multiple object attributes.

for example, the message should include the sum, max, and average values for attribute A and also the sum and min values of attribute B for the last 10 minutes

My question is what is the best way to implement such a requirement with Flink?

We though about using a custom window function that will run on all objects at the end of the window and calculate by itself all required values and output a new object that holds all of these aggregated values. The thing we are concerned about with this solution is the affect on the memory consumption having to hold all the window data in memory waiting for the window to fire (we will have many such windows opened at the same time)

Any suggestions / comments are highly appreciated!

Thanks

1

1 Answers

4
votes

The best approach would be to use incremental aggregation to compute the count, sum, min, and max for each window -- and you can compute the average in your window function, given the sum and count. In this way the only state you'll need to keep are these four values (the count, sum, min, and max), rather than having to buffer the entire stream for processing at the end of the window.

This example from the documentation should be enough to get you started.