0
votes

When does an RDD get it's preferred location? How is the preferred location determined? I've seen some weird behaviors in repartition and coalesce I could not quite make sense of: 1. When coalescing form n to n-1 partitions, I see spark just coalesce one partition to another single partition. (I think the ideal behavior would be evenly distribute to all n-1 nodes)

  1. When run repartition I see spark repartition such that one node have multiple partition of rdds.

Does the above behavior have something to do with preferedLocations?

1

1 Answers

0
votes

Note that rdd.repartition(n) just calls rdd.coalesce(n, shuffle = true), so we're just comparing shuffle true vs false.

shuffle = false

In this mode, Spark constructs a new RDD whose partitions contain one or more partitions of the parent RDD -- if you coalesce from n partitions -> n/2 partitions, then each partition consists of the elements from two semi-random partitions in the parent. This mode is appropriate when you want to reduce partitioning and the partitions are already balanced, like when you've done a filter that affects elements in each partition roughly equally. The overhead is very low. Also, note that it's impossible to increase number of partitions with this mode.

shuffle = true

For some background, I recommend this blog post for learning a bit more about how and why we shuffle. The fundamental differences in this execution mode are:

  • higher overhead (all data is transmitted over network)
  • good for rebalancing partitions (if you perform a filter that drops out either all elements in a partition or none, then shuffle=false will produce imbalanced partitions, but shuffle=true will resolve the issue)
  • can increase the number of partitions

Preferred locations don't have much to do with it -- you're seeing preferred locations only in the shuffle = false mode because the locality is preserved without shuffles, but after a shuffle the original preferredLocations are irrelevant (replaced with new preferred locations about shuffle destinations).