How to broadcast a variable in a Spark Streaming mapping function?

Question

I know the usual routine: sc.broadcast(x).

However, currently Spark Streaming does not support broadcast variables with checkpointing.

The official guide provides a solution: http://spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-and-broadcast-variables. However, this solution can be only used for foreachRDD functions.

Now I want to use large or unserializable variables (like a KafkaProducer) that need to be broadcast in this way in mapping functions (such as flatMapToPair), but since there is no visible RDD variables, I cannot retrieve the Spark context to broadcast the lazy-evaluated variable. If I use the initial context for creating DStreams or the context retrieved from DStreams, the task becomes not serializable.

So how can I use broadcast variables in mapping functions? Or is there any workaround for using large or unserializable variables in mapping functions?

Kai Yu Kai Yu · Accepted Answer · 2016-09-07T09:18:58

I finally find the solution. To use these features, use the transform functions rather than the map functions. In the transform functions, we manually handle RDDs and apply map functions on them, so we can get the reference of RDDs and thus get the Spark context from them.

How to broadcast a variable in a Spark Streaming mapping function?

1 Answers