Does Broadcasting a DataFrame spill the data to disk if the memory isn't available to hold the data?

Question

I have a question around spark Broadcast join. By default the Broadcast hash join size is 10MB.

case1: we have enough Memory in cluster to hold the Broadcast DF.

If the DF size is greater than the default broadcast join size, say 15 MB is the DF size, and If I broadcast this DF across all the nodes in the cluster, will it still perform a broadcast join? since 15MB is greater than default broadcast join size, will it go for any other join even though we have broadcast-ed the DF?

case2: Not enough Memory in cluster to hold the Broadcasted DF.

So let us suppose if I have 15MB Data Frame and If I want to Broadcast this Data Frame during the join, and the memory isn't available on say one or few nodes to hold this data.(15MB is a hypothetical number) Will it fail with out of Memory error or will it spill the data to disk?

Vitaliy Vitaliy · Accepted Answer · 2020-07-26T05:43:46

If you are trying to broadcast a dataframe larger than spark.sql.autoBroadcastJoinThreshold, Spark will issue an error.

I can't back it up by official documentation but I don't think there will be a spill to disk. You need to make sure both the driver and the worker can accommodate the full dataframe.

Does Broadcasting a DataFrame spill the data to disk if the memory isn't available to hold the data?

1 Answers