I found this strange problem while reading a spark dataframe. I repartitioned the dataframe into 50k partitions. However, when I read and perform a count action on the dataframe I found that the underlying rdd has only 2143 partitions when I am using spark 2.0.
So I went to the path where I saved the repartitioned data and found that
hfs -ls /repartitionedData/ | wc -l
50476
So it has created 50k paritions while saving the data.
However with spark 2.0,
val d = spark.read.parquet("repartitionedData")
d.rdd.getNumPartitions
res4: Int = 2143
But with spark 1.5,
val d = spark.read.parquet("repartitionedData")
d.rdd.partitions.length
res4: Int = 50474
Can someone help me with this?