I am learning spark using the book 'Learning Spark'. Came across this term(Page 54)
We can disable map-side aggregation in combineByKey() if we know that our data won’t benefit from it
I am confused what is meant by map-side aggregation here?. The only thing that comes to my mind is Mapper & Reducer in Hadoop MapReduce...but believe that is in no way related to Spark.
3 Answers
Idea behind using map-side aggregations is pretty much the same as Hadoop combiners. If a single mapper can yield multiple values for the same key you can reduce shuffling by reducing values locally.
One example of operation which can benefit from map-side aggregation is creating set of value for each key, especially when you partition a RDD before combining:
First lets create some dummy data:
val pairs = sc.parallelize(
("foo", 1) :: ("foo", 1) :: ("foo", 2) ::
("bar", 3) :: ("bar", 4) :: ("bar", 5) :: Nil
)
And merge data using combineByKey
:
import collection.mutable.{Set => MSet}
val combined = partitionedPairs.combineByKey(
(v: Int) => MSet[Int](v),
(set: MSet[Int], v: Int) => set += v,
(set1: MSet[Int], set2: MSet[Int]) => set1 ++= set2
)
Depending on the data distribution this can significantly reduce network traffic. Overall
reduceByKey
,combineByKey
withmapSideCombine
set totrue
aggregateByKey
foldByKey
will use map side aggregations, while groupByKey
and combineByKey
with mapSideCombine
set to false
won't.
The choice however between applying map side aggregations or not is not always obvious. Cost of maintaining required data structures and subsequent garbage collection can in many cases exceed cost of shuffle.
You're right, the term map-side reduce does come from the Map/Reduce land and the idea is a bit complicated in the Apache Spark side of things. If it's possible that we could combine multiple elements within a partition before shuffling the elements (and the combined elements took up less space) - then performing a per-partition reduction prior to shuffling the data would be useful.
One case where map-side reduction is disabled in Spark is with groupByKey
even if we can combine some of the elements in the same partition, they will take up about the same amount of space anyways so there is no corresponding reduction in network/serialization work.
Hope that helps and glad you are reading Learning Spark :)