My question about RDD is what happens when we try to create more
partitions than the data size
You can easily see how many partitions a given RDD has by using
data.getNumPartitions
. I tried creating RDD you have mentioned and running this command and it shows me there are 8 partitions. 4 partitions had one number each and rest 4 empty.
If it's creating 8 partitions then there is data replication in each
partition?
You can try following code and check the executor output to see how many records are there in each partition. Note the first print statement in the below code. I have to return something as required by API so returning each element multiplied by 2.
data.mapPartitionsWithIndex((x,y) => {println(s"partitions $x has ${y.length} records");y.map(a => a*2)}).collect.foreach(println)
I got following output for the above code -
partitions 0 has 0 records
partitions 1 has 1 records
partitions 2 has 0 records
partitions 3 has 1 records
partitions 4 has 0 records
partitions 5 has 1 records
partitions 6 has 0 records
partitions 7 has 1 records
I am curious about is does spark still create 8 partitions here or
optimize it to the number of cores?
Number of partitions defines how much data you want spark to process in one task. If there are 8 partitions and 4 virtual cores then spark would start running 4 tasks ( corresponding to 4 partitions) at once. As these tasks finishes, it will schedule remaining ones those cores.