Divide Spark RDD into N sub parts and keep them same for multiple iteration

Question

Problem Definition

Divide the Spark RDD into N(as defined by user) sub parts and perform same function on these sub parts with different data set on each sub part

Explanation

I am having RDD. I want to divide that RDD into sub parts let call him partitions. Now on each partition i want to perform same task for multiple iteration. Here is code that explains well

case class shape ( dim:Int) {
  val random = new Random()
  var X      : Array[Double]       =   Array.fill(dim)(random.nextDouble() * (100-10)+11 )
  var Y      : Array[Double]       =   Array.fill(dim)( math.random)
  var Loss       : Double              =   math.random
  var NewLoss       : Double              =   math.random 
}


val N = 1000 //in real N will be in millions
val d = 100 //in real N will be in millions
val nP = 4  // nP defines number of sub parts into which RDD is divided
val iterations = 1000 // in real analysis iteration will be in millions or billions  
val list = List.fill(N)(new shape(d))

 list.map { x =>
  x.Loss = function. SphereFunc(x.X) // Update Loss of each element
}

val rdd = sc.parallelize(batList, nP)
var partitioned = rdd.persist()

for(iter <- 1 to iterations) {

 partitioned = partitioned.mapPartitionsWithIndex { (k, iterator  )  =>
 val li = iterator.toList
 val localBest = li.minBy(_.Loss).X
 li.map { j =>
        j.Y = ((j.X, localBest).zipped.map(_ - _).map(_ * math.random), j.Y).zipped.map(_ + _)
        j.X = (j.X, j.Y).zipped.map(_ + _)

       }
 li.filter(math.random > _.Loss)
        .map { j =>
          j.X = localBest.map(_ + math.random)            
       }
 li.map{j => j.NewLoss = SphereFunc(j.X)

 li.filter(j => math.random < j.NewLoss && j.NewLoss < j.Loss).map { j =>
        j.Loss = j.NewLoss  
       }
  li.iterator
}

def SphereFunc(list: List[Double]): Double = {
list.foldLeft(0.0)((x, xs) => x + xs * xs)
}

In this code I created RDD and for each iteration I call mapPartitionsWithIndex to get each partition data and updated elements of each partition.

Problem with this code is that for each iteration, when I call mapPartitionsWithIndex it will perform required operation on each element of partitions and it will again create RDD. So partitions will not remain same. For each iteration RDD is divided again to partitions. Also will cause lot of shuffeling.

But I want to create partitions from RDD at start and than perform desired operation on partitions till user defined iteration completes. Here i am calling mapPartitionsWithIndex iterations times. Is this possible to call mapPartitionsWithIndex only one time and than iterate over partitions till iterations

How can I achieve that? I have to run experiments on cluster with more than 100 cores.

niuer niuer · Accepted Answer · 2019-08-30T19:35:38

Have you tried using groupby and udf? You can group your data into the number of partitions you want, then apply your iterative algorithm on each group. You can define your udf such that it runs from beginning to end for your algorithm. See example here: https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html

Divide Spark RDD into N sub parts and keep them same for multiple iteration

1 Answers