'map-side' aggregation in Spark

Question

I am learning spark using the book 'Learning Spark'. Came across this term(Page 54) We can disable map-side aggregation in combineByKey() if we know that our data won’t benefit from it I am confused what is meant by map-side aggregation here?. The only thing that comes to my mind is Mapper & Reducer in Hadoop MapReduce...but believe that is in no way related to Spark.

Well let's say that the reduceByKey transformation is based on combinedByKey and in which you don't actually need a map-side aggregation. Thus disabling the map-side aggregation is logical since we actually just need to make a reduce transformation here. — eliasah
@eliasah When he says 'map-side' , does it mean it is related to the map() method in RDD. When he says 'aggregation' does it mean he is referring to traversing through each element in RDD to perform some transformation on those elements? — Raj

zero323 zero323 · Accepted Answer · 2015-07-08T06:15:23

Idea behind using map-side aggregations is pretty much the same as Hadoop combiners. If a single mapper can yield multiple values for the same key you can reduce shuffling by reducing values locally.

One example of operation which can benefit from map-side aggregation is creating set of value for each key, especially when you partition a RDD before combining:

First lets create some dummy data:

val pairs = sc.parallelize(
    ("foo", 1) :: ("foo", 1) :: ("foo", 2) ::
    ("bar", 3) :: ("bar", 4) :: ("bar", 5) :: Nil
)

And merge data using combineByKey:

import collection.mutable.{Set => MSet}
val combined = partitionedPairs.combineByKey(
    (v: Int) => MSet[Int](v),
    (set: MSet[Int], v: Int) => set += v,
    (set1: MSet[Int], set2: MSet[Int]) => set1 ++= set2
)

Depending on the data distribution this can significantly reduce network traffic. Overall

reduceByKey,
combineByKey with mapSideCombine set to true
aggregateByKey
foldByKey

will use map side aggregations, while groupByKey and combineByKey with mapSideCombine set to false won't.

The choice however between applying map side aggregations or not is not always obvious. Cost of maintaining required data structures and subsequent garbage collection can in many cases exceed cost of shuffle.

'map-side' aggregation in Spark

3 Answers