I have a text file in HDFS, which has about 10 million records. I am trying to read the file do some transformations on that data. I am trying to uniformly partition the data before I do the processing on it. here is the sample code
var myRDD = sc.textFile("input file location")
myRDD = myRDD.repartition(10000)
and when I do my transformations on this re-partitioned data, I see that one partition has abnormally large number of records and others have very little data. (image of the distribution)
So the load is high on only one executor I also tried and got the same result
myRDD.coalesce(10000, shuffle = true)
is there a way to uniformly distribute records among partitions.
Attached is the shuffle read size/ number of records on that particular executor the circled one has a lot more records to process than the others
any help is appreciated thank you.