1
votes

I have one month data stored in HDFS. 31 folders each represent by date in the format yyyy-mm-dd. For example : 2020-01-30

Every 5 minutes we will get data and we will save the data using spark append mode as parquet files. So for an hour 12 files and for a day 288 files. so Each folder contains about 288 parquet files. So for January month, it is about 8928(31*288) parquet files.

I will be reading the data using spark.

Reading these much files will cause any performance issue?

Also If I maintain a file for each day. Let say each day contains only one parquet file and for january month, 31 parquet files.

Is there any increase in performance if I do this?

2

2 Answers

1
votes

Definitely your perfomance will increase if you can aggregate data of one day in fewer files. Depending on the size of each file and on the amount or executors/cores your Spark job has, you'll find the right number of partitions. If you write details about your data, such as size, number of columns, number of entries per day and columns type (String, date, int, etc..) we will be able to tell you a suggested optimal number to aggregate your data per day or hour.

I usually partition by day:

../my_parquet_table/year=2020/month=01/day=31/*.parquet

At this level I keep usually all parquet files lower the size of a block (256MB in my case).

0
votes

As per spark architecture it will try to create partition for data files resides on HDFS, and by default it is based on block size of HDFS you have. If you have small files on hdfs it will try to acquire as much as blocks per file in HDFS which ended with creating as many partition in spark, which may degrade performance as there is lot of shuffle operation involved in it and shuffle is costlier operation in spark.

In your case, if you storing file on every five minutes and if it small, then you can combine it to make one parquet file. But this activity something you need to do separately in spark where you can combine all small parquet files and create one parquet and process the created large parquet file further.

Another work around this if want to do it in one script only without performance issue then load small parquet files as it is, then use coalesce or repartition to create fewer partition which in turns make processing faster. Do selection smartly if you want to use repartition or coalesce.

I can share some code snippet here to do so,

data  = spark.read.parquet("/my_parquet_table/year=2020/month=01/day=31/")
pdata = data.reshuffle(5)   # here number of partition I put 5, but you can determine this number per data you receive every day

#use pdata for further operation

So at the end, you have two options either create separate script which will combine small parquet file to one or if you don't want to do it separately then repartition or coalesce data into fewer partitions and process data.