1
votes

I have a text file in HDFS, which has about 10 million records. I am trying to read the file do some transformations on that data. I am trying to uniformly partition the data before I do the processing on it. here is the sample code

var myRDD = sc.textFile("input file location")

myRDD = myRDD.repartition(10000)

and when I do my transformations on this re-partitioned data, I see that one partition has abnormally large number of records and others have very little data. (image of the distribution)

So the load is high on only one executor I also tried and got the same result

myRDD.coalesce(10000, shuffle = true)

is there a way to uniformly distribute records among partitions.

Attached is the shuffle read size/ number of records on that particular executor the circled one has a lot more records to process than the others

any help is appreciated thank you.

1
@Sudharnath- Could you please provide your sample input file which you are trying to load into RDD - vikrant rana
if you can provide us the sample file, I can suggest you a way to uniformly distribute the data. You cannot distribute data uniformly in RDD like you can with dataframe. Provide your sample data to look further. - vikrant rana
@vikrant rana , thanks for the reply. i solved this issue by using Dataframes - Sudharnath

1 Answers

0
votes

To deal with the skew, you can repartition your data using distribute by(or using repartition as you used). For the expression to partition by, choose something that you know will evenly distribute the data.

You can even use the primary key of the DataFrame(RDD).

Even this approach will not guarantee that data will be distributed evenly between partitions. It all depends on the hash of the expression by which we distribute. Spark : how can evenly distribute my records in all partition

Salting can be used which involves adding a new "fake" key and using alongside the current key for better distribution of data. (here is link for salting)