I have an rdd with n
partitions and I would like to split this rdd into k
rdds in such a way that
rdd = rdd_1.union(rdd_2).union(rdd_3)...union(rdd_k)
So for example if n=10
and k=2
I would like to end up with 2 rdds where rdd1 is composed of 5 partitions and rdd2 is composed of the other 5 partitions.
What is the most efficient way to do this in Spark?
repartition
. I cannot see how having an RDD per partition would server any purpose unless you also had you own partitioner. Also note, that there are many functions that can make use of the partition index so you can justreturn
for invalid partitions. Last but not least, usinggroupBy
could also be applicable if your partitions have a logical split. – Ioannis Deligiannis