spark read parquet with partition filters vs complete path

Question

I have a partitioned parquet data in hdfs example: hdfs://cluster/stage/data/datawarehouse/table=metrics_data/country=india/year=2020/month=06/day=30/hour=23/<part-files.parquet>

I would like to understand which is the best way to read the data:

df = spark.read.parquet("hdfs://cluster/stage/data/datawarehouse/table=metrics_data/country=india/year=2020/month=06/day=30/").where(col('hour') == "23")

OR

df = spark.read.parquet("hdfs://cluster/stage/data/datawarehouse/table=metrics_data/country=india/year=2020/month=06/day=30/hour=23")

I would like to understand more in terms of performance and other advantages if any.

dsk dsk · Accepted Answer · 2020-07-08T06:21:45

This is pretty straight forward, the first thing we will do while reading a file is to filter down unnecessary column using df = df.filter() this will filter down the data even before reading into memory, advanced files format like parquet, ORC supports the concept predictive push-down more here , this enables you to read data in way faster that loading the full data.

spark read parquet with partition filters vs complete path

2 Answers