I am new to scala/spark world and have recently started working on a task where it reads some data, processes it and saves it on S3. I have read several topics/questions on stackoverflow regarding repartition/coalesce performance and optimal number of partitions (like this one). Assuming that I have the right number of paritions, my questions is, would it be a good idea to repartition a rdd while converting it to dataframe? Here is how my code looks like at the moment:
val dataRdd = dataDf.rdd.repartition(partitions)
.map(x => ThreadedConcurrentContext.executeAsync(myFunction(x)))
.mapPartitions( it => ThreadedConcurrentContext.awaitSliding(it = it, batchSize = asyncThreadsPerTask, timeout = Duration(3600000, "millis")))
val finalDf = dataRdd
.filter(tpl => tpl._3 != "ERROR")
.toDF()
Here is what I'm planning to do (repartition data after filter):
val finalDf = dataRdd
.filter(tpl => tpl._3 != "ERROR")
.repartition(partitions)
.toDF()
My questions is, would it be a good idea to do so? is there a performance gain here?
Note1: Filter usually removes 10% of original data.
Note2: Here is the first part of spark-submit command that I use to run the above code:
spark-submit --master yarn --deploy-mode client --num-executors 4 --executor-cores 4 --executor-memory 2G --driver-cores 4 --driver-memory 2G