I have a spark job where I do the following
- Load the data from parquet via spark sql and convert it to a pandas df. The datasize is only 250 MB
- Run an rdd.foreach to iterate over a relatively some dataset(1000 rows) and take the pandas df from step 1 and do some transformation.
I get a Container killed by YARN for exceeding memory limits error after some iterations .
Container killed by YARN for exceeding memory limits. 14.8 GB of 6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead
I am unable to understand why the error says 14.8 GB of 6 GB physical memory used ?
I have tried increasing spark.yarn.executor.memoryOverhead I have used the following command
spark-submit --master yarn --deploy-mode cluster --num-executors 4 --executor-cores 2 --executor-memory 2G --conf spark.yarn.executor.memoryOverhead=4096 --py-files test.zip app_main.py
I am using spark 2.3
yarn.scheduler.minimum-allocation-mb = 512 MB
yarn.nodemanager.resource.memory-mb = 126 GB