11
votes

How can I take a rdd array of spark, and split it into two rdds randomly so each rdd will include some part of data (lets say 97% and 3%).

I thought to shuffle the list and then shuffledList.take((0.97*rddList.count).toInt)

But how can I Shuffle the rdd?

Or is there a better way to split the list?

2
Are all the items unique (i.e. no duplicates?) Just wondering if you can use takeSample() and then filter the sample out of the original list.DNA
Can be duplicate, but why does it matter, what would you be able to do if they are unique?griffon vulture
OK, I don't think the takeSample approach would work with duplicates.DNA
It is also problematic because I want to save also the second part (i.e. the 3%)griffon vulture

2 Answers

22
votes

I've found a simple and fast way to split the array:

val Array(f1,f2) = data.randomSplit(Array(0.97, 0.03))

It will split the data using the provided weights.

6
votes

You should use randomSplit method:

def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]

// Randomly splits this RDD with the provided weights.
// weights for splits, will be normalized if they don't sum to 1
// returns split RDDs in an array

Here is its implementation in spark 1.0:

def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]] = {
    val sum = weights.sum
    val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
    normalizedCumWeights.sliding(2).map { x =>
       new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](x(0), x(1)),seed)
    }.toArray
}