2
votes

I've read in a book that

Flink maintains one state instance per keyvalue and partitions all records with the same key to the

operator task that maintains the state for this key.

my question is:

lets say i have 4 tasks with 2 slots each. and there's a key that belongs to 95% of the data.

does it means that 95% the data is routed to the same machine?

2

2 Answers

0
votes

Yes, it does mean that. If you have a hot key, then partitioning by key doesn't scale well.

In some cases, there are ways to work around this limitation. For example, if you are computing analytics (e.g., you want to count page views per page per minute, and one page gets 95% of the page views), you can do pre-aggregation -- split the work for the hot key across several parallel instances, and then do one final, non-parallel reduction of the partial results. (This is just standard map/reduce logic.)

0
votes

This is called "data skew" and it is the bane of scalable applications everywhere.

It's also possible that the entire (100%) load goes to the same machine. There's no guarantee that the data is spread as evenly as possible by key, only that each key gets processed on a single machine. Technically, each key gets mapped to a key group (the number of key groups is the max parallelism for the topology) and each key group gets handled by a specific instance of an operator.

One way to handle this situation involves adding a second field to the key, resulting in a greater number of possible keys and possibly reducing the data skew across the keys. Then aggregate the results in a subsequent operator using just the one original key.