How flink partition data across state

Question

I've read in a book that

Flink maintains one state instance per keyvalue and partitions all records with the same key to the

operator task that maintains the state for this key.

my question is:

lets say i have 4 tasks with 2 slots each. and there's a key that belongs to 95% of the data.

does it means that 95% the data is routed to the same machine?

David Anderson David Anderson · Accepted Answer · 2019-10-29T08:21:02

Yes, it does mean that. If you have a hot key, then partitioning by key doesn't scale well.

In some cases, there are ways to work around this limitation. For example, if you are computing analytics (e.g., you want to count page views per page per minute, and one page gets 95% of the page views), you can do pre-aggregation -- split the work for the hot key across several parallel instances, and then do one final, non-parallel reduction of the partial results. (This is just standard map/reduce logic.)

How flink partition data across state

2 Answers