I have a large dataset in parquet format (~1TB in size) that is partitioned into 2 hierarchies: CLASS and DATE
There are only 7 classes. But the Date is ever increasing from 2020-01-01 onwards.
My data is partitioned by CLASS first and then DATE
So something like:
CLASS1---DATE 1
---DATE 2
--- .
--- .
--- .
---DATE N
CLASS2---DATE 1
---DATE 2
--- .
--- .
--- .
---DATE N
I load my data by CLASS in a for-loop. If I load the entire parquet file, YARN kills the job since it overloads the memory instances. But I load all the days since I am doing a percentile calculation in my modeling. This method takes about 23hrs to complete.
However, if I repartition such that I only have the CLASS partition, the job takes about 10hrs.
Does having too many sub-partitions slow down the spark executor jobs?
I keep the partition hierarchy as CLASS -> DATE only because I need to append new data by DATE every day.
If having only 1 partition is more efficient, then I would have to repartition to just the CLASS partition every day after loading new data.
Could someone explain why having a single partition works faster? And if so, what would be the best method to partition the data on a daily basis by appending and without repartitioning the entire dataset?
Thank You
EDIT:
I use the for loop on the file structure to loop by CLASS partition like so:
fs = s3fs.S3FileSystem(anon=False)
inpath="s3://bucket/file.parquet/"
Dirs= fs.ls(inpath)
for paths in Dirs:
customPath='s3://' + uvapath + '/'
class=uvapath.split('=')[1]
df=spark.read.parquet(customPath)
outpath="s3://bucket/Output_" + class + ".parquet"
#Perform calculations
df.write.mode('overwrite').parquet(outpath)
The loaded df will have all the dates for CLASS=1. I then output the file as separate parquet files for each CLASS such that I have 7 parquet files:
Output_1.parquet
Output_2.parquet
Output_3.parquet
Output_4.parquet
Output_5.parquet
Output_6.parquet
Output_7.parquet
I then merge the 7 parquets into a single parquet is not a problem as the resulting parquet files are much smaller.
CLASS=1/DATE=2020-01-01? Then this will be tough. - LamanusCLASS=1- thentangler