I have a dataframe that is created by a daily batch which runs for a specific day and then it is saved in HDFS (Azure Data Lake Gen 2).
It is saved using something like this
TradesWritePath="abfss://refined@"+datalakename+".dfs.core.windows.net/curated/clientscoredata/clientscoredatainput/Trades/" + TradeDatedt + "/Trades_" + tradedateInt + ".parquet"
tradesdf.write.mode("overwrite").parquet(TradesWritePath)
As you see, I do not partition the dataframe as it contains only one date.
So, as an example, the first file on the first day would be stored in the folder
Trades/2019/08/25
and then the next day, it would be in the folder
Trades/2019/08/26
Question is, when all the data has been put, will a filter predicate on dates still be pushed down and will HDFS know where to find the data instead of doing a full scan?
Or do I still have to write using Partition by option even though I am saving one day , just so Spark understands while reading and pushes it down to HDFS and HDFS also knows where to find it (instead of a fullscan)?
Next part of the question is:
When I look at the folder where the parquet file is stored, I see a lot of small "part-****.snappy.parquet files. I understand by reading some of the forum questions here that I have to use "repartition" option if I have to bring the number of files down. But question is- is it necessary? I read that too many small files will of course affect the performance, so one option could be to save it in chunks of 128 MB (if I am not sure of how the data would be used downstream meaning which columns, at the moment), so that the NameNode in HDFS is not overburdened and we do not also have too large files. Does this also apply to these "snappy parquet partition files" also?
Too many concepts and I am still trying to get my head around to find the best possible solution to store these dataframes as parquets, so any insight would be hugely appreciated.