The first difference of those two is that forEach()
is an action when mapPartition()
is a transformation. It would be more meaningful to compare forEach
with forEachPartition
since they are both actions and they both work on the final-accumulated data on the driver. Refer here for a detailed discussions over those two. As for the memory consumption it really depends on how much data you return to the driver. As a rule of thumb remember to return the results on the driver using methods like limit(), take(), first()
etc and avoid using collect()
unless you are sure that the data can fit on driver's memory.
The mapPartition
can be compared with the map
or flatMap
functions and they will modify the dataset's data by applying some transformation. mapPartition is more efficient since it will execute the given func fewer times when map will do the same of each item in the dataset. Refer here for more details about these two functions.