11
votes

How can I take a rdd array of spark, and split it into two rdds randomly so each rdd will include some part of data (lets say 97% and 3%).

I thought to shuffle the list and then shuffledList.take((0.97*rddList.count).toInt)

But how can I Shuffle the rdd?

Or is there a better way to split the list?

2
Are all the items unique (i.e. no duplicates?) Just wondering if you can use takeSample() and then filter the sample out of the original list. - DNA
Can be duplicate, but why does it matter, what would you be able to do if they are unique? - griffon vulture
OK, I don't think the takeSample approach would work with duplicates. - DNA
It is also problematic because I want to save also the second part (i.e. the 3%) - griffon vulture

2 Answers

22
votes

I've found a simple and fast way to split the array:

val Array(f1,f2) = data.randomSplit(Array(0.97, 0.03))

It will split the data using the provided weights.

6
votes

You should use randomSplit method:

def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]

// Randomly splits this RDD with the provided weights.
// weights for splits, will be normalized if they don't sum to 1
// returns split RDDs in an array

Here is its implementation in spark 1.0:

def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]] = {
    val sum = weights.sum
    val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
    normalizedCumWeights.sliding(2).map { x =>
       new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](x(0), x(1)),seed)
    }.toArray
}