2
votes

I have a partitioned parquet data in hdfs example: hdfs://cluster/stage/data/datawarehouse/table=metrics_data/country=india/year=2020/month=06/day=30/hour=23/<part-files.parquet>

I would like to understand which is the best way to read the data:

df = spark.read.parquet("hdfs://cluster/stage/data/datawarehouse/table=metrics_data/country=india/year=2020/month=06/day=30/").where(col('hour') == "23")

OR

df = spark.read.parquet("hdfs://cluster/stage/data/datawarehouse/table=metrics_data/country=india/year=2020/month=06/day=30/hour=23")

I would like to understand more in terms of performance and other advantages if any.

2

2 Answers

1
votes

This is pretty straight forward, the first thing we will do while reading a file is to filter down unnecessary column using df = df.filter() this will filter down the data even before reading into memory, advanced files format like parquet, ORC supports the concept predictive push-down more here , this enables you to read data in way faster that loading the full data.

1
votes

If you have a big hierarchy of directories/files, direct reading of single directory could be faster compared to the filtering, as Spark will need to build an index to apply that filter. See the following question & answer.