1
votes

I am new to scala/spark world and have recently started working on a task where it reads some data, processes it and saves it on S3. I have read several topics/questions on stackoverflow regarding repartition/coalesce performance and optimal number of partitions (like this one). Assuming that I have the right number of paritions, my questions is, would it be a good idea to repartition a rdd while converting it to dataframe? Here is how my code looks like at the moment:

val dataRdd = dataDf.rdd.repartition(partitions)
      .map(x => ThreadedConcurrentContext.executeAsync(myFunction(x)))
      .mapPartitions( it => ThreadedConcurrentContext.awaitSliding(it = it, batchSize = asyncThreadsPerTask, timeout = Duration(3600000, "millis")))

val finalDf = dataRdd
      .filter(tpl => tpl._3 != "ERROR")
      .toDF()

Here is what I'm planning to do (repartition data after filter):

val finalDf = dataRdd
          .filter(tpl => tpl._3 != "ERROR")
          .repartition(partitions)
          .toDF()

My questions is, would it be a good idea to do so? is there a performance gain here?

Note1: Filter usually removes 10% of original data.

Note2: Here is the first part of spark-submit command that I use to run the above code:

spark-submit --master yarn --deploy-mode client --num-executors 4 --executor-cores 4 --executor-memory 2G --driver-cores 4 --driver-memory 2G
1

1 Answers

0
votes

The answer to your problem depends on the size of your dataRdd, number of partitions, executor-cores and processing power of your HDFS cluster.

With this in mind, you should run some tests on your cluster with different values of partitions and removing repartition altogether to fine tune it and find an accurate answer.

Example - if you specify partitions=8 and executor-cores=4 then you will be fully utilizing all your cores, however if the size of your dataRdd is only 1GB, then there is no advantage in repartitioning because it triggers shuffle which incurs performance impact. In addition to that if the processing power of your HDFS cluster is low or it is under a heavy load then there is an additional performance overhead due to that.

If you do have sufficient resources available on your HDFS cluster and you have a big (say over 100GB) dataRDD then a repartition should help in improving performance with config values in the example above.