I have Spark SQL query that through several small Parquet files (about ~ 2M).
I have Spark block size as 256 M so I want to combine these small files into one (or may be more) file/s of size 256 M. What I am thinking is to find the Data frame size and divide it by 256 M such that I know the how many files will be their, but unfortunately Spark does not support finding the Data frame size since it is distributed. I am thinking to convert the data frame to data set to list that can check the size of them.