Pyspark OLD dataframe partition to New Dataframe

Question

I have a partitioned dataframe say df1. From df1 i will create df2 and df3..

 df1 = df1.withColumn("key", concat("col1", "col2", "col3"))
 df1 =df1.repartition(400, "key")    

 df2 = df.groupBy("col1", "col2").agg(sum(colx))
 df3 = df1.join(df2, ["col1", "col2"])

I want to know will df3 retain same partition of df1? or do i need to re-partition df3 again?.

Mariusz Mariusz · Accepted Answer · 2016-11-11T20:19:39

Partitioning of df3 will be totally different comparing to df1. And (probably) df2 will have spark.sql.shuffle.partitions (default: 200) number of partitions, not 400.

Pyspark OLD dataframe partition to New Dataframe

1 Answers