EDIT: adding more context to the question now that I reread the post again:
Let's say I have a pyspark dataframe that I am working with and currently I can repartition the dataframe as such:
dataframe.repartition(200, col_name)
And I write that partitioned dataframe out to a parquet file. When reading the directory, I see that the directory in the warehouse is partitioned the way I want:
/apps/hive/warehouse/db/DATE/col_name=1
/apps/hive/warehouse/db/DATE/col_name=2
I want to understand how I can repartition this in multiple layers, meaning I partition one column for the top level partition, a second column for the second level partition, and a third column for the third level partition. Is it as easy as adding a partitionBy() to a write method?
dataframe.mode("overwrite").partitionBy("col_name1","col_name2","col_name3")
Thus creating the directories as such?
/apps/hive/warehouse/db/DATE/col_name1=1
|--------------------------------------->/col_name2=1
|--------------------------------------------------->/col_name3=1
If so, can I use a partitionBy() to write out a max number of files per partition?