I'm using Spark 2.2.1.
I have a small DataFrame (less than 1M) and I have a computation on a big DataFrame that will need this small one to compute a column in an UDF.
What is the best option regarding performance
Is it better to broadcast this DF (I don't know if Spark will do the cartesian into memory).
bigDF.crossJoin(broadcast(smallDF))
.withColumn(udf("$colFromSmall", $"colFromBig"))
or to collect it and use the small
value directly in the udf
val small = smallDF.collect()
bigDF.withColumn(udf($"colFromBig"))