Local aggregation for data stream in Flink

Question

I'm trying to find a good way to combine Flink keyed WindowedStream locally for Flink application. The idea is to similar to a combiner in MapReduce: to combine partial results in each partition (or mapper) before the data (which is still a keyed WindowedStream) is sent to a global aggregator (or reducer). The closest function I found is: aggregate but I was't be able to find a good example for the usage on WindowedStream.

It looks like aggregate doesn't allow a WindowedStream output. Is there any other way to solve this?

There's an open "FLIP" to address this need: cwiki.apache.org/confluence/display/FLINK/… — Yann

Felipe Felipe · Accepted Answer · 2020-03-20T09:23:50

There have been some initiatives to provide pre-aggregation in Flink. You have to implement your own operator. In the case of stream environment you have to extend the class AbstractStreamOperator. KurtYoung implemented a BundleOperator. You can also use the Table API on top of the stream API. The Table API is already providing a local aggregation. I also have one example of the pre-aggregate operator that I implemented myself. Usually, the drawback of all those solutions is that you have to set the number of items to pre-aggregate or the timeout to pre-aggregate. If you don't have it you can run out of memory, or you never shuffle items (if the threshold number of items is not achieved). In other words, they are rule-based. What I would like to have is something that is cost-based, more dynamic. I would like to have something that adjusts those parameters in run-time.

I hope these links can help you. And, if you have ideas for the cost-based solution, please come to talk with me =).

Local aggregation for data stream in Flink

1 Answers