0
votes

In Apache Spark,

repartition(n) - allows partitioning the RDD into exactly n partitions.

But how to partition the given RDD into partitions such that all partitions (exception for the last partition) have specified number of elements. Given that number of elements in RDD is not known and doing .count() is expensive.

C = sc.parallelize([x for x in range(10)],2)
Let's say internally,  C = [[0,1,2,3,4,5], [6,7,8,9]]  
C = someCode(3)

Expected:

C = [[0,1,2], [3,4,5], [6, 7, 8], [9]]
1

1 Answers

0
votes

Quite easy in pyspark:

    C = sc.parallelize([x for x in range(10)],2)
    rdd = C.map(lambda x : (x, x))
    C_repartitioned = rdd.partitionBy(4,lambda x: int( x *4/11)).map(lambda x: x[0]).glom().collect()
    C_repartitioned

    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]

It is called custom partitioning. More on that: http://sparkdatasourceapi.blogspot.ru/2016/10/patitioning-in-spark-writing-custom.html

http://baahu.in/spark-custom-partitioner-java-example/