I'm a beginner to Apache Spark. Spark's RDD API offers transformation functions like map, mapPartitions. I can understand that map works with each element from the RDD but mapPartitions works with each partition and many people have mentioned mapPartitions are ideally used where we want to do object creation/instantiation and provided examples like:
val rddData = sc.textFile("sample.txt")
val res = rddData.mapPartitions(iterator => {
//Do object instantiation here
//Use that instantiated object in applying the business logic
})
My question is can we not do that with map function itself by doing object instantiation outside the map function like:
val rddData = sc.textFile("sample.txt")
val obj = InstantiatingSomeObject
val res = rddData.map(element =>
//Use the instantiated object 'obj' and do something with data
)
I could be wrong in my fundamental understanding of map and mapPartitions and if the question is wrong, please correct me.
Serializablethough. And note that you should only read data from the object, as your driver code won't see any changes made to object - Raphael Roth