Spark: Broadcast usage on local mode

Question

I know broadcast allows to keep a read-only copy cached on each machine rather than shipping a copy of it with tasks. But, I would like to know if broadcasting has any huge impact when it is used in Local Mode as I don't have a cluster of nodes. Or is it just ok to use without broadcast in a local mode? I'm just trying to understand its usage.

Spark Version #2.0,Scala Version #2.10 Local Mode - 8Cores CPU 64GB RAM

I have something like below:

case class EmpDim(name: String,age: Int)

empDF
+-----+-------+------+
|EmpId|EmpName|EmpAge|
+-----+-------+------+
|    1|   John|    32|
|    2|  David|    45|
+-----+-------+------+

deptDF
+------+--------+-----+
|DeptID|DeptName|EmpID|
+------+--------+-----+
|     1|   Admin|    1|
|     2|      HR|    2|
|     3| Finance|    4|
+------+--------+-----+

val empRDD = empDF.rdd.map(x => (x.getInt(0), EmpDim(x.getString(1), x.getInt(2))))

val lookupMap = empRDD.collectAsMap() //Without Broadcast
val broadCastLookupMap: Broadcast[Map[Int,EmpDim]] = sc.broadcast(empRDD.collectAsMap()) //With Broadcast

def lookup(lookupMap:Map[Int,EmpDim]) = udf[Option[EmpDim],Int]((empID:Int) => lookupMap.lift(empID))

val combinedDF = deptDF.withColumn("lookupEmp",lookup(lookupMap)($"EmpID")) //Without Broadcast
                       .withColumn("broadCastLookupEmp",lookup(broadCastLookupMap.value)($"EmpID")) //With Broadcast
                       .withColumn("EmpName",coalesce($"lookupEmp.name",lit("Unknown - No Name to Lookup")))
                       .withColumn("EmpAge",coalesce($"lookupEmp.age",lit("Unknown - No Age to Lookup")))
                       .drop("lookupEmp")
                       .drop("broadCastLookupEmp")

+------+--------+-----+---------------------------+--------------------------+
|DeptID|DeptName|EmpID|EmpName                    |EmpAge                    |
+------+--------+-----+---------------------------+--------------------------+
|1     |Admin   |1    |John                       |32                        |
|2     |HR      |2    |David                      |45                        |
|3     |Finance |4    |Unknown - No Name to Lookup|Unknown - No Age to Lookup|
+------+--------+-----+---------------------------+--------------------------+

In the above scenario, is it advisable to use broadcast or it's kind of overkill? Please advice

Alper t. Turker Alper t. Turker · Accepted Answer · 2018-04-18T16:41:11

When used like this, broadcasting has no value at all.

When you call:

lookup(broadCastLookupMap.value)($"EmpID")

broadCastLookupMap.value will be evaluated locally, according to Scala substitution model.

Correct implementation would be:

def lookup(lookupMap: Broadcast[Map[Int, EmpDim]]) = udf[Option[EmpDim],Int](
  (empID:Int) => lookupMap.value.lift(empID)
)

and called:

lookup(broadCastLookupMap)($"EmpID")

which might have some positive impact depending on actual execution plan. Local or non-local mode - the same rules apply

If data is reused between stages (explicitly or implicitly) broadcasting can be useful.
If data is used only once in the pipeline, standard closure / argument processing mechanism is enough.

Here nothing suggest the first case, so broadcast should be obsolete but if you want to be sure, test both solution using live environment and compare results.

Calling by name should work too:

def lookup(lookupMap: => Map[Int,EmpDim]) = udf[Option[EmpDim],Int](
  (empID:Int) => lookupMap.lift(empID)
)

Spark: Broadcast usage on local mode

1 Answers