I am working on my bachelor's final project, which is about the comparison between Apache Spark Streaming and Apache Flink (only streaming) and I have just arrived to "Physical partitioning" in Flink's documentation. The matter is that in this documentation it doesn't explain well how this two transformations work. Directly from the documentation:
shuffle(): Partitions elements randomly according to a uniform distribution.
rebalance(): Partitions elements round-robin, creating equal load per partition. Useful for performance optimisation in the presence of data skew.
Both are automatically done, so what I understand is that they both redistribute equally (shuffle() > uniform distribution & rebalance() > round-robin) and randomly the data. Then I deduce that rebalance() distributes the data in a better way ("equal load per partitions") so the tasks have to process the same amount of data, but shuffle() may create bigger and smaller partitions. Then, in which cases might you prefer to use shuffle() than rebalance()?
The only thing that comes to my mind is that probably rebalance()requires some processing time so in some cases it might use more time to do the rebalancing than the time it will improve in the future transformations.
I have been looking for this and nobody has talked about this, only in a mailing list of Flink, but they don't explain how shuffle() works.
Thanks to Sneftel who has helped me to improve my question asking me things to let me rethink about what I wanted to ask; and to Till who answered quite well my question. :D
rebalancewas random? - Sneftelrebalance()is more efficient? They're just different approaches.shuffle()is a randomized approach to load-balancing,rebalance()is an explicit greedy one. - Sneftelshuffle()distributes the elements in a random and uniform way so it may not create same loaded partitions, meanwhilerebalance()tries to create all the partitions with the same load. Then I deduce thatrebalance()does the same but in a more efficient way for the work distribution as all the TaskManagers will have approximately the same data to process. Then ifrebalance()does better the same job, why will anyone useshuffle()? May the processing needed byrebalance()produce more latency than which it can improve in some cases? Thanks :) - froblesmartin