I try to repartition a DataFrame according to a column the the DataFrame has N
(let say N=3
) different values in the partition-column x
, e.g:
val myDF = sc.parallelize(Seq(1,1,2,2,3,3)).toDF("x") // create dummy data
What I like to achieve is to repartiton myDF
by x
without producing empty partitions. Is there a better way than doing this?
val numParts = myDF.select($"x").distinct().count.toInt
myDF.repartition(numParts,$"x")
(If I don't specify numParts
in repartiton
, most of my partitions are empty (as repartition
creates 200 partitions) ...)
spark.sql.shuffle.partitions
– UninformedUser