I know the usual routine: sc.broadcast(x)
.
However, currently Spark Streaming does not support broadcast variables with checkpointing.
The official guide provides a solution: http://spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-and-broadcast-variables. However, this solution can be only used for foreachRDD functions.
Now I want to use large or unserializable variables (like a KafkaProducer
) that need to be broadcast in this way in mapping functions (such as flatMapToPair
), but since there is no visible RDD variables, I cannot retrieve the Spark context to broadcast the lazy-evaluated variable. If I use the initial context for creating DStreams or the context retrieved from DStreams, the task becomes not serializable.
So how can I use broadcast variables in mapping functions? Or is there any workaround for using large or unserializable variables in mapping functions?