3
votes

I'm using Spark 2.2.1.

I have a small DataFrame (less than 1M) and I have a computation on a big DataFrame that will need this small one to compute a column in an UDF.

What is the best option regarding performance

Is it better to broadcast this DF (I don't know if Spark will do the cartesian into memory).

bigDF.crossJoin(broadcast(smallDF))
     .withColumn(udf("$colFromSmall", $"colFromBig"))

or to collect it and use the small value directly in the udf

val small = smallDF.collect()
bigDF.withColumn(udf($"colFromBig"))
1

1 Answers

1
votes

Both will collect data first, so in terms of memory footprint there is no difference. So the choice should be dictated by the logic:

  • If you can do better than default execution plan and don't want to create your own, udf might be a better approach.
  • If it is just a Cartesian, and requires subsequent explode - perish the though - just go with the former option.

As suggested in the comments by T.Gawęda in the second case you can use broadcast

val small = spark.spark.broadcast(smallDF.collect())
bigDF.withColumn(udf($"colFromBig"))

It might provide some performance improvements if udf is reused.