Difference between shuffle() and rebalance() in Apache Flink

Question

I am working on my bachelor's final project, which is about the comparison between Apache Spark Streaming and Apache Flink (only streaming) and I have just arrived to "Physical partitioning" in Flink's documentation. The matter is that in this documentation it doesn't explain well how this two transformations work. Directly from the documentation:

shuffle(): Partitions elements randomly according to a uniform distribution.

rebalance(): Partitions elements round-robin, creating equal load per partition. Useful for performance optimisation in the presence of data skew.

Source: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html#physical-partitioning

Both are automatically done, so what I understand is that they both redistribute equally (shuffle() > uniform distribution & rebalance() > round-robin) and randomly the data. Then I deduce that rebalance() distributes the data in a better way ("equal load per partitions") so the tasks have to process the same amount of data, but shuffle() may create bigger and smaller partitions. Then, in which cases might you prefer to use shuffle() than rebalance()?

The only thing that comes to my mind is that probably rebalance()requires some processing time so in some cases it might use more time to do the rebalancing than the time it will improve in the future transformations.

I have been looking for this and nobody has talked about this, only in a mailing list of Flink, but they don't explain how shuffle() works.

Thanks to Sneftel who has helped me to improve my question asking me things to let me rethink about what I wanted to ask; and to Till who answered quite well my question. :D

It is not random, it tries to put the elements in a balanced way for the processing to be more efficient, but if this is the only difference, then i find no reason for "shuffle()" to exist as "rebalance()" does the same in a more efficient way. — froblesmartin
what makes you think that rebalance() is more efficient? They're just different approaches. shuffle() is a randomized approach to load-balancing, rebalance() is an explicit greedy one. — Sneftel
From the documentation I understand that shuffle() distributes the elements in a random and uniform way so it may not create same loaded partitions, meanwhile rebalance() tries to create all the partitions with the same load. Then I deduce that rebalance() does the same but in a more efficient way for the work distribution as all the TaskManagers will have approximately the same data to process. Then if rebalance() does better the same job, why will anyone use shuffle()? May the processing needed by rebalance() produce more latency than which it can improve in some cases? Thanks :) — froblesmartin

Till Rohrmann Till Rohrmann · Accepted Answer · 2017-05-15T07:45:11

As the documentation states, shuffle will randomly distribute the data whereas rebalance will distribute the data in a round robin fashion. The latter is more efficient since you don't have to compute a random number. Moreover, depending on the randomness, you might end up with some kind of not so uniform distribution.

On the other hand, rebalance will always start sending the first element to the first channel. Thus, if you have only few elements (fewer elements than subtasks), then only some of the subtasks will receive elements, because you always start to send the first element to the first subtask. In the streaming case this should eventually not matter because you usually have an unbounded input stream.

The actual reason why both methods exist is a historically reason. shuffle was introduced first. In order to make the batch an streaming API more similar, rebalance was then introduced.

Difference between shuffle() and rebalance() in Apache Flink

2 Answers