3
votes

I've read the spark doc and other related Q&As in SO, but I am still unclear about some details on Spark Broadcast variables, especially, the statement in bold:

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

  1. what is "common data"?
  2. if the variable is only used in 1 stage, does it mean broadcasting it is not useful, regardless of its memory footprint?
  3. Since broadcast effectively "reference" the variable on each executor instead of copying it multiple times, in what scenario broadcasting is a BAD idea? I mean why this broadcasting behavior is not the default spark behavior?

Thank you!

1

1 Answers

1
votes

Your question has almost all the answers you need.

what is "common data"?

The data which is referred by/read by multiple executors. For example, dictionary lookup. Assume you have 100 executors running a task that needs some huge dictionary lookup. Without broadcast variables, you would load this data in every executor. With broadcast variables, you just have to load it once and all the executors will refer to the same dictionary. Hence you save a lot of space.

For more detail: https://blog.knoldus.com/2016/04/30/broadcast-variables-in-spark-how-and-when-to-use-them/

if the variable is only used in 1 stage, does it mean broadcasting it is not useful, regardless of its memory footprint?

No and Yes. No, if your single stage has hundreds to thousands of executors! Yes, if you stage has vert few executors.

Since broadcast effectively "reference" the variable on each executor instead of copying it multiple times, in what scenario broadcasting is a BAD idea? I mean why this broadcasting behavior is not the default spark behavior?

The data broadcasted this way is cached in serialized form and deserialized before running each task. So, if the data being broadcasted is very very huge, serialization and deserialization become costly operations. So in such cases you should avoid using broadcast variables.