Spark - scala: shuffle RDD / split RDD into two random parts randomly

Question

How can I take a rdd array of spark, and split it into two rdds randomly so each rdd will include some part of data (lets say 97% and 3%).

I thought to shuffle the list and then shuffledList.take((0.97*rddList.count).toInt)

But how can I Shuffle the rdd?

Or is there a better way to split the list?

Are all the items unique (i.e. no duplicates?) Just wondering if you can use takeSample() and then filter the sample out of the original list. — DNA
Can be duplicate, but why does it matter, what would you be able to do if they are unique? — griffon vulture
OK, I don't think the takeSample approach would work with duplicates. — DNA
It is also problematic because I want to save also the second part (i.e. the 3%) — griffon vulture

griffon vulture griffon vulture · Accepted Answer · 2014-07-21T13:02:00

I've found a simple and fast way to split the array:

val Array(f1,f2) = data.randomSplit(Array(0.97, 0.03))

It will split the data using the provided weights.