I have a question with respect to default partitioning in RDD.
case class Animal(id:Int, name:String)
val myRDD = session.sparkContext.parallelize( (Array( Animal(1, "Lion"), Animal(2,"Elephant"), Animal(3,"Jaguar"), Animal(4,"Tiger"), Animal(5, "Chetah") ) ))
Console println myRDD.getNumPartitions
I am running the above piece of code in my laptop which has 12 logical cores. Hence I see that there are 12 partitions created.
My understanding is that hash partitioning is used to determine which object needs to go to which partition. So in this case, the formula would be: hashCode() % 12 But when I further examine, I see all the RDDs are put in the last partition.
myRDD.foreachPartition( e => { println("----------"); e.foreach(println) } )
Above code prints the below(first eleven partitions are empty and the last one has all the objects. The line is to separate contents of each partition):
----------
----------
----------
----------
----------
----------
----------
----------
----------
----------
----------
----------
Animal(2,Elephant)
Animal(4,Tiger)
Animal(3,Jaguar)
Animal(5,Chetah)
Animal(1,Lion)
I don't know why this happens. Can you please help.
Thanks!