Load parquet folders to spark dataframe based on condition

Question

I have directory which has folders based on the date and running date is part of the folder name. I have a daily spark job in which i need to load last 7 days files on any given day.

Unfortunately the folder contains other files as well to try partition discovery.

I have files as below format.

prefix-yyyyMMdd/

How to load folders within last 7 days in one shot.?

Since it is running date, i cannot have predefined regex that can be used to load the data, as i have to consider month and year changes.

I have couple of brute force solutions

to load all the data into 7 dataframes and do unionAll with all 7, to get one dataframe from 7 dataframes. This looks performance inefficient, but not a entirely bad one
Load entire folder and do where condition on column that has the date. This looks storage heavy, as the folder contains years worth of data

Both doesn't look performance efficient and considering each file data it self is huge, i would like to know if there are any better solutions.

Is there a better way to do it.?

Charlie Flowers Charlie Flowers · Accepted Answer · 2019-07-05T22:49:53

DataFrameReader methods can take multiple paths, e.g.

spark.read.parquet("prefix-20190704", "prefix-20190703", ...)

Load parquet folders to spark dataframe based on condition

1 Answers