Spark Dataframe loosing Partition

Question

I found this strange problem while reading a spark dataframe. I repartitioned the dataframe into 50k partitions. However, when I read and perform a count action on the dataframe I found that the underlying rdd has only 2143 partitions when I am using spark 2.0.

So I went to the path where I saved the repartitioned data and found that

hfs -ls /repartitionedData/ | wc -l
50476

So it has created 50k paritions while saving the data.

However with spark 2.0,

val d = spark.read.parquet("repartitionedData")
d.rdd.getNumPartitions
res4: Int = 2143

But with spark 1.5,

val d = spark.read.parquet("repartitionedData")
d.rdd.partitions.length
res4: Int = 50474

Can someone help me with this?

Shaido Shaido · Accepted Answer · 2017-08-11T08:43:09

It's not that you are losing data, Spark only change the number of partitions. FileSourceStrategy combines the parquet files into fewer partitions and also reorders the data.

This is something that changed when Spark got upgraded into version 2.0. You can find a somewhat related bug report here.

Spark Dataframe loosing Partition

1 Answers