Too many timers cost too much time when checkpointing in Flink

Question

I have a situation to do sliding count over large scale of messages using State and TimeService. The sliding size is one and the window size is larger than 10 hours. The problem I meet is the checkpointing takes a lot of time. In order to improve the performance we use the incremental checkpoints. But it is still slow when the system do the checkpoint. We figure out that the most of the time is used to serialize the timers which are used to clean data. We have a timer for each key and there are about 300M timers at all.

Any suggestion to solve this problem would be appreciated. Or we can do the count in another way? ———————————————————————————————————————————— I'd like to add some details to the situation. The sliding size is one event and the window size is more than 10 hours(There are about 300 events per second), we need to react on each event. So in this situation we did not use the windows provided by Flink. we use the keyed state to store the previous information instead. The timers is used in ProcessFunction to trigger the cleaning job of the old data. At last the number of the dinstinct keys is very large.

Could you provide a more detailled description? I tried to answer you, but it's difficult without more details — diegoreico
Please clarify the situation. The sliding size is "one" what? One hour, one minute, or one event? Into how many different windows is each event being assigned? How does the windowing relate to the timers in question (are you talking about the timers flink uses for timeWindows, or something in a ProcessFunction)? Are there actually 300M distinct keys? — David Anderson
Thanks for your attentions. I add sone details to the situation. I hope that can clarify the question. — Barry Bai

David Anderson David Anderson · Accepted Answer · 2018-03-20T23:10:56

I think this should work:

Dramatically reduce the number of keys Flink is working with from 300M down to 100K (for example), by effectively doing something like keyBy(key mod 100000). Your ProcessFunction can then use a MapState (where the keys are the original keys) to store whatever it needs.

MapStates have iterators, which you can use to periodically crawl each of these maps to expire old items. Stick to the principle of having only one timer per key (per uberkey, if you will), so that you only have 100K timers.

UPDATE:

Flink 1.6 included FLINK-9485, which allows timers to be checkpointed asynchronously, and to be stored in RocksDB. This makes it much more practical for Flink applications to have large numbers of timers.

Too many timers cost too much time when checkpointing in Flink

3 Answers