You should use randomSplit method:
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]
// Randomly splits this RDD with the provided weights.
// weights for splits, will be normalized if they don't sum to 1
// returns split RDDs in an array
Here is its implementation in spark 1.0:
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]] = {
val sum = weights.sum
val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
normalizedCumWeights.sliding(2).map { x =>
new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](x(0), x(1)),seed)
}.toArray
}
takeSample()and then filter the sample out of the original list. - DNA