Getting java.lang.OutOfMemoryError: on submitting pyspark application

Question

I am running pyspark application in 32 core, 64 GB Server using spark-submit command.

Steps in Application:

df1 = Load 500 Million dataset from csv file(field1, field2, field3, field4).
df2 = Load 500 Million entries from mongodb(using spark mongo adapter)(field1, field2, field3).
Left Join operation(step throwing exception java.lang.OutOfMemoryError: Java heap space):

df_output = df1.join(df2, ["field1", "field2", "field3"], "left_outer").select("*")
Updating mongo collections using df_output with append mode.

Configuration in conf/spark-env.sh:

SPARK_EXECUTOR_INSTANCES=10
SPARK_EXECUTOR_CORES=3
SPARK_EXECUTOR_MEMORY=5GB
SPARK_WORKER_CORES=30
SPARK_WORKER_MEMORY=50GB

and there are more parameters that are set to there default value.

Setting up master and 1 worker with commands.

sbin/start-master.sh
/sbin/start-slave.sh master_ip

running script with command

nohup bin/spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0 --master master_ip ../test_scripts/test1.py > /logs/logs.out &

What should be the best approach for tuning configuration parameters for optimal performance for this dataset plus how should we configure above parameters for any dataset?

Vikrame Vikrame · Accepted Answer · 2017-02-23T20:24:57

Few things to consider if you run into memory issue. Make sure you set below paramters accordingly.

spark.executor.memory=yarn.nodemanager.resource.memory-mb * (spark.executor.cores / yarn.nodemanager.resource.cpu-vcores)

spark.yarn.executor.memoryOverhead=15-20% of spark.executor.memory

Try to increase spark.sql.shuffle.output.partitions parameter to greater than 2000 (default 200). Hope this helps

Getting java.lang.OutOfMemoryError: on submitting pyspark application

1 Answers