Does Spark maintain parquet partitioning on read?

Question

I am having a lot trouble finding the answer to this question. Let's say I write a dataframe to parquet and I use repartition combined with partitionBy to get a nicely partitioned parquet file. See Below:

df.repartition(col("DATE")).write.partitionBy("DATE").parquet("/path/to/parquet/file")

Now later on I would like to read the parquet file so I do something like this:

val df = spark.read.parquet("/path/to/parquet/file")

Is the dataframe partitioned by "DATE"? In other words if a parquet file is partitioned does spark maintain that partitioning when reading it into a spark dataframe. Or is it randomly partitioned?

Also the why and why not to this answer would be helpful as well.

You will have the same number of partitions as you have the folders with the name /path/to/parquet/file/DATE=* — philantrovert
@philantrovert I was reading about some concerns that this approach causes work to be done on the Driver. For metadata I would imagine that is not an issue - or is it? Also, when using S3, I am assuming the Hive mestatore need not be updated for partitioned parquet access necessarily. Or would you recommend Msck repair table ... always (as they are external tables). Thanks in advance. — thebluephantom

bsplosion bsplosion · Accepted Answer · 2018-08-16T12:27:29

The number of partitions acquired when reading data stored as parquet follows many of the same rules as reading partitioned text:

If SparkContext.minPartitions >= partitions count in data, SparkContext.minPartitions will be returned.
If partitions count in data >= SparkContext.parallelism, SparkContext.parallelism will be returned, though in some very small partition cases, #3 may be true instead.
Finally, if the partitions count in data is somewhere between SparkContext.minPartitions and SparkContext.parallelism, generally you'll see the partitions reflected in the dataset partitioning.

Note that it's rare for a partitioned parquet file to have full data locality for a partition, meaning that, even when the partitions count in data matches the read partition count, there is a strong likelihood that the dataset should be repartitioned in memory if you're trying to achieve partition data locality for performance.

Given your use case above, I'd recommend immediately repartitioning on the "DATE" column if you're planning to leverage partition-local operations on that basis. The above caveats regarding minPartitions and parallelism settings apply here as well.

val df = spark.read.parquet("/path/to/parquet/file")
df.repartition(col("DATE"))

Does Spark maintain parquet partitioning on read?

3 Answers