I am running pyspark application in 32 core, 64 GB Server using spark-submit command.
Steps in Application:
df1 = Load 500 Million dataset from csv file(field1, field2, field3, field4).
df2 = Load 500 Million entries from mongodb(using spark mongo adapter)(field1, field2, field3).
Left Join operation(step throwing exception java.lang.OutOfMemoryError: Java heap space):
df_output = df1.join(df2, ["field1", "field2", "field3"], "left_outer").select("*")
Updating mongo collections using df_output with append mode.
Configuration in conf/spark-env.sh:
- SPARK_EXECUTOR_INSTANCES=10
- SPARK_EXECUTOR_CORES=3
- SPARK_EXECUTOR_MEMORY=5GB
- SPARK_WORKER_CORES=30
- SPARK_WORKER_MEMORY=50GB
and there are more parameters that are set to there default value.
Setting up master and 1 worker with commands.
sbin/start-master.sh/sbin/start-slave.sh master_ip
running script with command
nohup bin/spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0 --master master_ip ../test_scripts/test1.py > /logs/logs.out &
What should be the best approach for tuning configuration parameters for optimal performance for this dataset plus how should we configure above parameters for any dataset?