How does spark determine the preferredLocation of an RDD in repartition and coalesce?

Question

When does an RDD get it's preferred location? How is the preferred location determined? I've seen some weird behaviors in repartition and coalesce I could not quite make sense of: 1. When coalescing form n to n-1 partitions, I see spark just coalesce one partition to another single partition. (I think the ideal behavior would be evenly distribute to all n-1 nodes)

When run repartition I see spark repartition such that one node have multiple partition of rdds.

Does the above behavior have something to do with preferedLocations?

Tim Tim · Accepted Answer · 2016-12-06T10:56:24

Note that rdd.repartition(n) just calls rdd.coalesce(n, shuffle = true), so we're just comparing shuffle true vs false.

shuffle = false

In this mode, Spark constructs a new RDD whose partitions contain one or more partitions of the parent RDD -- if you coalesce from n partitions -> n/2 partitions, then each partition consists of the elements from two semi-random partitions in the parent. This mode is appropriate when you want to reduce partitioning and the partitions are already balanced, like when you've done a filter that affects elements in each partition roughly equally. The overhead is very low. Also, note that it's impossible to increase number of partitions with this mode.

shuffle = true

For some background, I recommend this blog post for learning a bit more about how and why we shuffle. The fundamental differences in this execution mode are:

higher overhead (all data is transmitted over network)
good for rebalancing partitions (if you perform a filter that drops out either all elements in a partition or none, then shuffle=false will produce imbalanced partitions, but shuffle=true will resolve the issue)
can increase the number of partitions

Preferred locations don't have much to do with it -- you're seeing preferred locations only in the shuffle = false mode because the locality is preserved without shuffles, but after a shuffle the original preferredLocations are irrelevant (replaced with new preferred locations about shuffle destinations).

How does spark determine the preferredLocation of an RDD in repartition and coalesce?

1 Answers

shuffle = false

shuffle = true