2
votes

I'm trying to find the best way to read partitioned data from parquet files and write them back keeping hierarchy in Spark. When I use spark.read.parquet(inputPath) Spark reads all the partitions from directory hierarchy and represents them as column but when I write that dataframe back I loose all the hierarchy. To keep them I should use .write.partitionBy. This require specification of partition columns. Is there is a more automatic way to do this?

1

1 Answers

1
votes

The partitions columns aren't actually stored in DataFrames meta and, as far as I know, you can't get them back. Spark discovers the partitions by parsing the file paths when it scans the data folder. You read more about it here.

So, I think you can get these partition columns back by extracting them from the file paths you are loading. You can list recursivlly the data folder and then extract partitions from the path. As they have the form column_name=value, you can use some regex.

Another option is to get these paths using the DataFrame by selecting the input_file_name. Something like this :

// get all file paths to list
val listPath = df.select(input_file_name()).distinct.collectAsList()

// extract distinct parition columns 
val partitionRegex = """^(.+)=(.+)"""

val partitions = listPath.map(_.split("/"))
                         .flatten
                         .filter(fi => fi matches partitionRegex)
                         .map(_.split("=")(0))
                         .distinct

val partitionColumns = for (c <- partitions) yield col(c) 

// write with partitionBy
df.write.partitionBy(partitionColumns:_*).parquet("/path")