0
votes

I am running a Spark application with 5 executors with 5 cores per executor. However, I have noticed that only a single executor does most of the work (i.e most of the tasks are done there). The jobs that I am running are highly parallel (20 partitions or greater). How do you explain this behavior?

Even if I decrease the number of cores per executor, results to just running less tasks on that single executor at the same time. Should I limit the memory per executor so that more executors are used (just in case the whole data fits on a single executor)?

1
It depends up on the transformations and the data set that you are working on. Example if your data is (1,2),(2,2),(1,3),(1,9)(1,10) and you do a reduceByKey all your data with key '1' will be in single executor.Knight71
Simple transformations and actions like df.map(lambda x:x).count() seem to be running on the same executor. So, no key is really involved.ml_0x
little bit of code snippet and sample data might help .Knight71
It turns out that the problem was on the stored data files of Hive. Seems like setting this option mapred.max.split.size resolves the problem.ml_0x

1 Answers

0
votes

Just to add my two cents, for people facing this issue in future. This kind of issue usually arises as a result of skewed partition size of the RDD/Dataframe. To debug the problem, you can check the partition sizes of the RDD, to find out if there is any outlier there. And if there is one, you can see the elements in that big partition to get a sense of what's going on.

A similar issue is addressed in detail in this stackoverflow question.