repartitioning by multiple columns for Pyspark dataframe

Question

EDIT: adding more context to the question now that I reread the post again:

Let's say I have a pyspark dataframe that I am working with and currently I can repartition the dataframe as such:

dataframe.repartition(200, col_name)

And I write that partitioned dataframe out to a parquet file. When reading the directory, I see that the directory in the warehouse is partitioned the way I want:

/apps/hive/warehouse/db/DATE/col_name=1
/apps/hive/warehouse/db/DATE/col_name=2

I want to understand how I can repartition this in multiple layers, meaning I partition one column for the top level partition, a second column for the second level partition, and a third column for the third level partition. Is it as easy as adding a partitionBy() to a write method?

dataframe.mode("overwrite").partitionBy("col_name1","col_name2","col_name3")

Thus creating the directories as such?

/apps/hive/warehouse/db/DATE/col_name1=1
|--------------------------------------->/col_name2=1
|--------------------------------------------------->/col_name3=1

If so, can I use a partitionBy() to write out a max number of files per partition?

yes, partitionBy will work to create relevant directory structure based on order given. — Ramdev Sharma
@RamdevSharma Can I specify the maxiumum number of files I can write to per partition like when writing dataframe.repartition(numPartitions, Col)? — user7298979
with repartition and partitionBy, you can control how many files you wants to write in each physical partition on file system — Ramdev Sharma

Ramdev Sharma Ramdev Sharma · Accepted Answer · 2020-11-03T20:47:22

Repartition

Function repartition will control memory partition of data. If you specify repartition as 200 then in memory you will have 200 partitions.

Physical Partition on file system

Function partitionBy with given columns list control directory structure. Physical partitions will be created based on column name and column value. Each partition can create as many files as specified in repartition (default 200) will be created provided you have enough data to write.

This is sample example based on your question.

dataframe.
repartition(200).
write.mode("overwrite").
partitionBy("col_name1","col_name2","col_name3")

It will give 200 files in each partition and partitions will be created based on given order.

repartitioning by multiple columns for Pyspark dataframe

1 Answers