Setting Environment variables in Spark Cluster Mode

Question

I was going through this Apache Spark documentation, and it mentions that:

When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file.

I am running my EMR cluster on AWS data pipeline. I wanted to know that where do I have to edit this conf file. Also, if I create my own custom conf file, and specify it as part of --configurations (in the spark-submit), will it solve my use-case?

Zouzias Zouzias · Accepted Answer · 2017-11-28T15:21:59

One way to do it, is the following: (The tricky part is that you might need to setup the environment variables on both executor and driver parameters)

spark-submit   \
 --driver-memory 2g \
 --executor-memory 4g \
 --conf spark.executor.instances=4 \
 --conf spark.driver.extraJavaOptions="-DENV_KEY=ENV_VALUE" \
 --conf spark.executor.extraJavaOptions="-DENV_KEY=ENV_VALUE" \
 --master yarn \
 --deploy-mode cluster\
 --class com.industry.class.name \
   assembly-jar.jar

I have tested it in EMR and client mode but should work on cluster mode as well.

Setting Environment variables in Spark Cluster Mode

2 Answers