Parquet Internals & Spark

Question

I have a dataframe that is created by a daily batch which runs for a specific day and then it is saved in HDFS (Azure Data Lake Gen 2).

It is saved using something like this

TradesWritePath="abfss://refined@"+datalakename+".dfs.core.windows.net/curated/clientscoredata/clientscoredatainput/Trades/" + TradeDatedt + "/Trades_" + tradedateInt + ".parquet"
tradesdf.write.mode("overwrite").parquet(TradesWritePath)

As you see, I do not partition the dataframe as it contains only one date.

So, as an example, the first file on the first day would be stored in the folder

Trades/2019/08/25

and then the next day, it would be in the folder

Trades/2019/08/26

Question is, when all the data has been put, will a filter predicate on dates still be pushed down and will HDFS know where to find the data instead of doing a full scan?

Or do I still have to write using Partition by option even though I am saving one day , just so Spark understands while reading and pushes it down to HDFS and HDFS also knows where to find it (instead of a fullscan)?

Next part of the question is:

When I look at the folder where the parquet file is stored, I see a lot of small "part-****.snappy.parquet files. I understand by reading some of the forum questions here that I have to use "repartition" option if I have to bring the number of files down. But question is- is it necessary? I read that too many small files will of course affect the performance, so one option could be to save it in chunks of 128 MB (if I am not sure of how the data would be used downstream meaning which columns, at the moment), so that the NameNode in HDFS is not overburdened and we do not also have too large files. Does this also apply to these "snappy parquet partition files" also?

Too many concepts and I am still trying to get my head around to find the best possible solution to store these dataframes as parquets, so any insight would be hugely appreciated.

Avishek Bhattacharya Avishek Bhattacharya · Accepted Answer · 2019-09-09T08:52:52

Spark would know from where to pickup the data iff the data stored as

root/    
   date=ddmmyy/
   date=dd1mm1yy1/
...

The = sign is important. You can't have arbitrary directory structure for predicate push down. It has to be in the aforementioned format.

In your example

You need to store something like

  root/
      Trades=2019/08/25
      Trades=2019/08/26

Spark leverages hive partition discovery mechanism to detect partitions in a table. Hive partitioning requires data to be laid in the specific manner.

Coming to the next part of your question. Keeping small files in HDFS is very bad regardless of the type of the file. And yes it is true for the snappy partition files. You should use repartition or coalescefunction to keep the file size close to 128 MB.

The responsibility of namenode is to track all the files in HDFS. The block size in HDFS is 128 MB. Therefore, try to keep the .parquet file size close to 128 MB but not more. If you keep it more, the HDFS will use 2 blocks to represent the data.

Parquet Internals & Spark

2 Answers