Examples borrowed from Internet, thanks to those with better insights.
The following can be found on various forums in relation to mapPartitions and map:
... Consider the case of Initializing a database. If we are using map() or
foreach(), the number of times we would need to initialize will be equal to
the no of elements in RDD. Whereas if we use mapPartitions(), the no of times
we would need to initialize would be equal to number of Partitions ...
Then there is this response:
val newRd = myRdd.mapPartitions(
partition => {
val connection = new DbConnection /*creates a db connection per partition*/
val newPartition = partition.map(
record => {
readMatchingFromDB(record, connection)
})
connection.close()
newPartition
})
So, my questions are after having read discussions on various items pertaining to this:
- Whilst I can understand the performance improvement using mapPartitions in general, why would according to the first snippet of text, the database connection be called every time for each element of an RDD using map? I can't seem to find the right reason.
- The same things does not happen with sc.textFile ... and reading into dataframes from jdbc connections. Or does it? I would be very surprised if this was so.
What am I missing...?