Spark through small parquet files that I need to combine them in one file

Question

I have Spark SQL query that through several small Parquet files (about ~ 2M).

I have Spark block size as 256 M so I want to combine these small files into one (or may be more) file/s of size 256 M. What I am thinking is to find the Data frame size and divide it by 256 M such that I know the how many files will be their, but unfortunately Spark does not support finding the Data frame size since it is distributed. I am thinking to convert the data frame to data set to list that can check the size of them.

afeldman afeldman · Accepted Answer · 2019-06-18T22:48:40

The function you are looking for is Size Estimator, it will return the number of bytes your file is. Spark is horrendous when it comes to files and number of files. To control the number of files being output you are going to want to run the repartition command because the number of output files form Spark is directly associated with number of partitions the object has. For my example below I am taing the size of an arbitrary input data frame find the "true" number of partitions (the reason for the + 1 is Spark on longs and ints innately rounds down so 0 partitions would be impossible.

Hope this helps!

import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.DataFrame 
import org.apache.spark.util.SizeEstimator 

val inputDF2 : Long = SizeEstimator.estimate(inputDF.rdd) 
//find its appropiate number of partitions 
val numPartitions : Long = (inputDF2/134217728) + 1 
//write it out with that many partitions  
val outputDF = inputDF.repartition(numPartitions.toInt)

Spark through small parquet files that I need to combine them in one file

1 Answers