I have many parquet file directories on HDFS that contain a few thousands of small(most < 100kb) parquet files each. They slow down my Spark job, so I want to combine them.
With the following code I can repartition the local parquet file to smaller number of parts:
val pqFile = sqlContext.read.parquet("file:/home/hadoop/data/file.parquet")
pqFile.coalesce(4).write.save("file:/home/hadoop/data/fileSmaller.parquet")
But I don't know how to get the size of a directory on HDFS through Scala code programmatically, hence I can't work out the number of partitions to pass to the coalesce
function for the real data set.
How can I do this? Or is there a convenient way within Spark so that I can configure the writer to write fixed size of parquet partition?