Note that rdd.repartition(n) just calls rdd.coalesce(n, shuffle = true), so we're just comparing shuffle true vs false.
shuffle = false
In this mode, Spark constructs a new RDD whose partitions contain one or more partitions of the parent RDD -- if you coalesce from n partitions -> n/2 partitions, then each partition consists of the elements from two semi-random partitions in the parent. This mode is appropriate when you want to reduce partitioning and the partitions are already balanced, like when you've done a filter that affects elements in each partition roughly equally. The overhead is very low. Also, note that it's impossible to increase number of partitions with this mode.
shuffle = true
For some background, I recommend this blog post for learning a bit more about how and why we shuffle. The fundamental differences in this execution mode are:
- higher overhead (all data is transmitted over network)
- good for rebalancing partitions (if you perform a filter that drops out either all elements in a partition or none, then
shuffle=false will produce imbalanced partitions, but shuffle=true will resolve the issue)
- can increase the number of partitions
Preferred locations don't have much to do with it -- you're seeing preferred locations only in the shuffle = false mode because the locality is preserved without shuffles, but after a shuffle the original preferredLocations are irrelevant (replaced with new preferred locations about shuffle destinations).