1
votes

I am using broadcast variable to join operation in Spark. But I meet issue about the time broadcast to load from driver to executor. So I just want load once but use for multi job(range application cycle).

Link my ref: https://github.com/apache/spark/blob/branch-2.2/core/src/test/scala/org/apache/spark/broadcast/BroadcastSuite.scala

3

3 Answers

1
votes

Broadcast variables are not related to a job but to a session/context. If you reuse the same SparkSession it's likely that the broadcast variable will be reused. If I recall correctly, under certain types of memory pressure the workers may clear the broadcast variable but, if it is referenced, it would be automatically re-broadcast to satisfy the reference.

0
votes

Broadcast variables, which can be used to cache a value in memory on all nodes. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.

EdhBroadcast broadcast = new EdhBroadcast(JavaSparkContext);

0
votes

It's not possible Broadcast variables are used to send some immutable state once to each worker. You use them when you want a local copy of a variable.

You can create rdd and cache rdd and reuse.