Differences: Object instantiation within mapPartitions vs outside

Question

I'm a beginner to Apache Spark. Spark's RDD API offers transformation functions like map, mapPartitions. I can understand that map works with each element from the RDD but mapPartitions works with each partition and many people have mentioned mapPartitions are ideally used where we want to do object creation/instantiation and provided examples like:

val rddData = sc.textFile("sample.txt")
val res = rddData.mapPartitions(iterator => {
   //Do object instantiation here
   //Use that instantiated object in applying the business logic
   })

My question is can we not do that with map function itself by doing object instantiation outside the map function like:

val rddData = sc.textFile("sample.txt")
val obj = InstantiatingSomeObject

val res = rddData.map(element => 
    //Use the instantiated object 'obj' and do something with data
)

I could be wrong in my fundamental understanding of map and mapPartitions and if the question is wrong, please correct me.

all the transformations done inside map and mapPartitions are done in executors. if you create object outside map or mapPartition then it is in the driver node which is not visible by the executors. So you will have to pass the object with the map or mapPartitions if you declare it outside the map. — Ramesh Maharjan
you can instantiate the object outside, it will be serialized and also sent to the executors. This only works if your object implements Serializablethough. And note that you should only read data from the object, as your driver code won't see any changes made to object — Raphael Roth

werner werner · Accepted Answer · 2018-02-25T16:40:58

All objects that you create outside of your lambdas are created on the driver. For each execution of the lambda, they are sent over the network to the specific executor.

When calling map, the lambda is executed once per data element, causing to send your serialized object once per data element over the network. When using mapPartitions, this happens only once per partition. However, even when using mapPartions, it would usually be better to create the object inside of your lambda. In many cases, your object is not serializable (like a database connection for example). In this case you have to create the object inside your lambda.

Differences: Object instantiation within mapPartitions vs outside

1 Answers