Spark RDD- map vs mapPartitions

Question

I read through theoretical differences between map and mapPartitions, & 'm much clear when to use them in varied situations.

But my problem described below is more based upon GC activity & Memory (RAM). Please read below for the problem:-

=> I wrote a map function to convert Row to String. So, an input of RDD[org.apache.spark.sql.Row] would be mapped to RDD[String]. But with this approach map object would be created for every row of an RDD. Thus creation of such large number of objects may increase GC activity.

=> To resolve above, I thought of using mapPartitions. So, that number of objects become equivalent to number of partitions. mapPartitions gives Iterator as an input and accepts to return and java.lang.Iterable. But most of the Iterable like Array, List, etc are in memory. So, if I have huge amount of data then would creating a Iterable this way can lead to out of Memory ? or Is there any other collection (java or scala) that should be utilized here (to spill to Disk in case memory starts to fill)? or should we only use mapPartitions in case RDD is completely in Memory?

Thanks in advance. Any help would be greatly appreciated.

user7236328 user7236328 · Accepted Answer · 2016-12-01T13:16:36

If you think about JavaRDD.mapPartitions it takes FlatMapFunction (or some variant like DoubleFlatMapFunction) which is expected to return Iterator not Iterable. If underlaying collection is lazy then you have nothing to worry about.

RDD.mapPartitions takes a functions from Iterator to Iterator.

I general if you use reference data you can replace mapPartitions with map and use static member to store data. This will have the same footprint and will be easier to write.

Spark RDD- map vs mapPartitions

2 Answers