1
votes

When i am trying to understand the difference between coalesce() and repartition(), I understood that coalesce can only reduce number of partitions of dataframe and if we try to increase the number of partitions then no of partitions remain unchanged. As per the https://stackoverflow.com/a/45854701/1784552 coalesce is used only to decrease number of partitions.

But when i tried to execute below code, I observed two things

  1. For Dataframe with coalesce number of partitions can be increased
  2. For Rdd if shuffle = false then number of partitions cannot be increase with coalesce.

Does it mean that with coalesce dataframe partitions can be increased?

    val h1b1Df = spark.read.csv("/FileStore/tables/h1b_data.csv")
    println("Original dataframe partitions = "+h1b1Df.rdd.getNumPartitions)

    val cloasedDf = h1b1Df.coalesce(2)
    println("Coalesced dataframe partitions = "+cloasedDf.rdd.getNumPartitions

    val cloasedDf1 = cloasedDf.coalesce(6) 
    println("Coalesced dataframe with increased partitions = "+cloasedDf1.rdd.getNumPartitions) 

// out put is

Original dataframe partitions = 8

Coalesced dataframe partitions = 2

Coalesced dataframe with increased partitions = 6

val inpRdd = h1b1Df.rdd
println("Original rdd partitions = "+inpRdd.getNumPartitions)

val colasedRdd = inpRdd.coalesce(4)
println("Coalesced rdd partitions = "+colasedRdd.getNumPartitions)

val colasedRdd1 = colasedRdd.coalesce(6,false)
println("Coalesced rdd with increased partitions = "+colasedRdd1.getNumPartitions)

// Output

Original rdd partitions = 8

Coalesced rdd partitions = 4

Coalesced rdd with increased partitions = 4

2
I would focus on learning useful things and use the software as intended. This is all going nowhere knowledge imho.thebluephantom
Correct understanding of api would help write better code. Dont demotivateNiketa
It’s called advice. You are free to ignore it.thebluephantom

2 Answers

0
votes

Coalesce can be used to increase partitions by setting shuffle=true which is equal to repartition. When you use coalesce with shuffle=false to increase, data movement wont happen. So one partition data cant be moved to another partition. Whereas while reduce it just merges the nearest partitions.

Thanks,

0
votes

Coalesce for dataframe cannot increase partitions greater than total number of cores in the cluster.

 val h1b1Df = spark.read.csv("/FileStore/tables/h1b_data.csv")
 h1b1Df.rdd.getNumPartitions        // prints 8

 val cloasedDf = h1b1Df.coalesce(21)  
 cloasedDf.rdd.getNumPartitions     // prints 8

 val cloasedDf1 = cloasedDf.coalesce(2) // prints 2
 cloasedDf1.rdd.getNumPartitions

 val cloasedDf2 = cloasedDf.coalesce(7) // prints 7
 cloasedDf2.rdd.getNumPartitions