I have a partitioned parquet data in hdfs example: hdfs://cluster/stage/data/datawarehouse/table=metrics_data/country=india/year=2020/month=06/day=30/hour=23/<part-files.parquet>
I would like to understand which is the best way to read the data:
df = spark.read.parquet("hdfs://cluster/stage/data/datawarehouse/table=metrics_data/country=india/year=2020/month=06/day=30/").where(col('hour') == "23")
OR
df = spark.read.parquet("hdfs://cluster/stage/data/datawarehouse/table=metrics_data/country=india/year=2020/month=06/day=30/hour=23")
I would like to understand more in terms of performance and other advantages if any.