co-located vs. co-partitioned RDDs

Question

I'm quite newbie in Spark and I have 2 Question:

I have a large set of points and I made an RDD (called partitionedData) from them and partitioned it based on a custom partitioner so that each partition has at most a threshold number of points. Because I need to choose some Points as a leader in each partition and be sure that the corresponding leaders and points in each partition are in a same node, I mapPartitions the partitionedData and set the preservesPartitioning flag as true. Finally, the result of this RDD is my desired leader RDD. Here is my first question: I know that the leader RDD preserves it's parent RDD partitioning (co-partitioned), but I'm not sure if the the leaders in each partition will be placed in a same node as their parents Points (co-located)?
If the answer of the above question is NO, so how can I co-locate the partitions of a given RDD with another pre-partitioned RDD?

Javier Bañez Javier Bañez · Accepted Answer · 2018-02-19T15:38:54

In order to be co-located for you to be able to guarantee no shuffling all the co-partition has to be done within the same action.

If you would have intermediary actions the Integer index the custom partitioner creates could be assigned to different nodes and in this case shuffle would be required.

co-located vs. co-partitioned RDDs

1 Answers