So assume ive got an rdd with 3000 rows. The 2000 first rows are of class 1 and the 1000 last rows are of class2. The RDD is partitioned across 100 partitions.
When calling RDD.randomSplit(0.8,0.2)
Does the function also shuffle the rdd? Our does the splitting simply sample 20% continuously of the rdd? Or does it select 20% of the partitions randomly?
Ideally does the resulting split have the same class distribution as the original RDD. (i.e. 2:1)
Thanks