Spark RDD: Vary the size of each partition

Question

I need executors to finish processing data at different times.

I think the easiest way is to make RDD partitions have not uniform sizes. How can I do this?

Amit Aviv Amit Aviv · Accepted Answer · 2015-12-02T02:48:20

Not sure what you are trying to achieve, but you can partition the RDD anyway you like using partitionBy eg:

sc.parallelize(xrange(10)).zipWithIndex()
  .partitionBy(2, lambda x: 0 if x<2 else 1)
  .glom().collect()

[[(0, 0), (1, 1)], [(2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9)]]

Note that it works on a (k,v) RDD and the partitioning function takes only k as a param

Spark RDD: Vary the size of each partition

1 Answers