I have a parquet file /df
saved in hdfs with 120 partitions. The size of each partition on hdfs is around 43.5 M.
Total size
hdfs dfs -du -s -h /df
5.1 G 15.3 G /df
hdfs dfs -du -h /df
43.6 M 130.7 M /df/pid=0
43.5 M 130.5 M /df/pid=1
...
43.6 M 130.9 M /df/pid=119
I want to load that file into Spark and keep the same number of partitions. However, Spark will automatically load the file into 60 partitions.
df = spark.read.parquet('df')
df.rdd.getNumPartitions()
60
HDFS settings:
'parquet.block.size'
is not set.
sc._jsc.hadoopConfiguration().get('parquet.block.size')
returns nothing.
'dfs.blocksize' is set to 128.
float(sc._jsc.hadoopConfiguration().get("dfs.blocksize"))/2**20
returns
128
Changing either of those values to something lower does not result in the parquet file loading into the same number of partitions that are in hdfs.
For example:
sc._jsc.hadoopConfiguration().setInt("parquet.block.size", 64*2**20)
sc._jsc.hadoopConfiguration().setInt("dfs.blocksize", 64*2**20)
I realize 43.5 M is well below 128 M. However, for this application, I am going to immediately complete many transformations that will result in each of the 120 partitions getting much closer to 128 M.
I am trying to save myself having to repartition in the application imeadiately after loading.
Is there a way to force Spark to load the parquet file with the same number of partitions that are stored on the hdfs?
spark.conf.set("spark.files.maxPartitionBytes", 43500000)
– thePurplePython