Avoid repartition costs when filtering and then coalescing

Question

I am implementing a range query on an RDD of (x,y) points in pyspark. I partitioned the xy space into a 16*16 grid (256 cells) and assigned each point in my RDD to one of these cells. The gridMappedRDD is a PairRDD: (cell_id, Point object)

I partitioned this RDD to 256 partitions, using:

gridMappedRDD.partitionBy(256)

The range query is a rectangular box. I have a method for my Grid object which can return the list of cell ids which overlap with the query range. So, I used this as a filter to prune the unrelated cells:

filteredRDD = gridMappedRDD.filter(lambda x: x[0] in candidateCells)

But the problem is that when running the query and then collecting the results, all the 256 partitions are evaluated; A task is created for each partition.

To avoid this problem, I tried coalescing the filteredRDD to the length of candidateCell list and I hoped this could solve the problem.

filteredRDD.coalesce(len(candidateCells))

In fact the resulting RDD has len(candidateCells) partitions but the partitions are not the same as gridMappedRDD.

As stated in the coalesce documentation, the shuffle parameter is False and no shuffle should be performed among partitions but I can see (with the help of glom()) that this is not the case.

For example after a coalesce(4) with candidateCells=[62, 63, 78, 79] the partitions are like this:

[[(62, P), (62, P) .... , (63, P)],
 [(78, P), (78, P) .... , (79, P)],
 [], []
]

Actually, by coalescing, I have a shuffle read which equals to the size of my whole dataset for every task, which takes a significant time. What I need is an RDD with only partitions related to cells in candidateCells, without any shuffles. So, my question is that is it possible to filter only some partitions without reshuffling? For the above example, my filteredRDD would have 4 partitions with exactly the same data as originalRDD's 62, 63, 78, 79th partitions. Doing so, the query could be directed to affecting partitions only.

Alper t. Turker Alper t. Turker · Accepted Answer · 2018-05-03T22:03:11

You made a few incorrect assumptions here:

The shuffle is not related to coalesce (nor coalesce is useful here). It is caused by partitionBy. Partitioning by definition requires shuffle.
Partitioning cannot be used to optimize filter. Spark knows nothing about the function you use (it is a black box).
Partitioning doesn't uniquely map keys to partitions. Multiple keys can be placed on the same partition - How does HashPartitioner work?

What can you do:

If resulting subset is small repartition and apply lookup for each key:

from itertools import chain

partitionedRDD = gridMappedRDD.partitionBy(256)

chain.from_iterable(
    ((c, x) for x in partitionedRDD.lookup(c)) 
    for c in candidateCells
)

If data is large you can try to skip scanning partitions (number of tasks won't change, but some task can be short circuited):

candidatePartitions = [
    partitionedRDD.partitioner.partitionFunc(c) for c in candidateCells
]

partitionedRDD.mapPartitionsWithIndex(
    lambda i, xs: (x for x in xs if x[0] in candidateCells) if i in candidatePartitions else []
)

This two methods make sense only if you perform multiple "lookups". If it is one-off operation, it is better to perform linear filter:

It is cheaper than shuffle and repartitioning.
If initial data is uniformly distributed downstream processing will be able to better utilize available resources.

Avoid repartition costs when filtering and then coalescing

1 Answers