I've read the spark doc and other related Q&As in SO, but I am still unclear about some details on Spark Broadcast variables, especially, the statement in bold:
Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
- what is "common data"?
- if the variable is only used in 1 stage, does it mean broadcasting it is not useful, regardless of its memory footprint?
- Since broadcast effectively "reference" the variable on each executor instead of copying it multiple times, in what scenario broadcasting is a BAD idea? I mean why this broadcasting behavior is not the default spark behavior?
Thank you!