0
votes

I have a spark job where I do the following

  1. Load the data from parquet via spark sql and convert it to a pandas df. The datasize is only 250 MB
  2. Run an rdd.foreach to iterate over a relatively some dataset(1000 rows) and take the pandas df from step 1 and do some transformation.

I get a Container killed by YARN for exceeding memory limits error after some iterations .

Container killed by YARN for exceeding memory limits. 14.8 GB of 6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead

I am unable to understand why the error says 14.8 GB of 6 GB physical memory used ?

I have tried increasing spark.yarn.executor.memoryOverhead I have used the following command

spark-submit --master yarn --deploy-mode cluster --num-executors 4 --executor-cores 2 --executor-memory 2G --conf spark.yarn.executor.memoryOverhead=4096 --py-files test.zip app_main.py

I am using spark 2.3

yarn.scheduler.minimum-allocation-mb = 512 MB
yarn.nodemanager.resource.memory-mb = 126 GB
1

1 Answers

0
votes

This is one of the common error when memoryOverhead option is used, it is better to use other options to tune jobs.

http://ashkrit.blogspot.com/2018/09/anatomy-of-apache-spark-job.html post talks about this issue and how to deal with it.