1
votes

I have directory which has folders based on the date and running date is part of the folder name. I have a daily spark job in which i need to load last 7 days files on any given day.

Unfortunately the folder contains other files as well to try partition discovery.

I have files as below format.

prefix-yyyyMMdd/

How to load folders within last 7 days in one shot.?

Since it is running date, i cannot have predefined regex that can be used to load the data, as i have to consider month and year changes.

I have couple of brute force solutions

  1. to load all the data into 7 dataframes and do unionAll with all 7, to get one dataframe from 7 dataframes. This looks performance inefficient, but not a entirely bad one

  2. Load entire folder and do where condition on column that has the date. This looks storage heavy, as the folder contains years worth of data

Both doesn't look performance efficient and considering each file data it self is huge, i would like to know if there are any better solutions.

Is there a better way to do it.?

1

1 Answers

1
votes

DataFrameReader methods can take multiple paths, e.g.

spark.read.parquet("prefix-20190704", "prefix-20190703", ...)