When i am trying to understand the difference between coalesce() and repartition(), I understood that coalesce can only reduce number of partitions of dataframe and if we try to increase the number of partitions then no of partitions remain unchanged. As per the https://stackoverflow.com/a/45854701/1784552 coalesce is used only to decrease number of partitions.
But when i tried to execute below code, I observed two things
- For Dataframe with coalesce number of partitions can be increased
- For Rdd if shuffle = false then number of partitions cannot be increase with coalesce.
Does it mean that with coalesce dataframe partitions can be increased?
val h1b1Df = spark.read.csv("/FileStore/tables/h1b_data.csv")
println("Original dataframe partitions = "+h1b1Df.rdd.getNumPartitions)
val cloasedDf = h1b1Df.coalesce(2)
println("Coalesced dataframe partitions = "+cloasedDf.rdd.getNumPartitions
val cloasedDf1 = cloasedDf.coalesce(6)
println("Coalesced dataframe with increased partitions = "+cloasedDf1.rdd.getNumPartitions)
// out put is
Original dataframe partitions = 8
Coalesced dataframe partitions = 2
Coalesced dataframe with increased partitions = 6
val inpRdd = h1b1Df.rdd
println("Original rdd partitions = "+inpRdd.getNumPartitions)
val colasedRdd = inpRdd.coalesce(4)
println("Coalesced rdd partitions = "+colasedRdd.getNumPartitions)
val colasedRdd1 = colasedRdd.coalesce(6,false)
println("Coalesced rdd with increased partitions = "+colasedRdd1.getNumPartitions)
// Output
Original rdd partitions = 8
Coalesced rdd partitions = 4
Coalesced rdd with increased partitions = 4