
I was going through this Apache Spark documentation, and it mentions that:

When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file.

I am running my EMR cluster on AWS data pipeline. I wanted to know that where do I have to edit this conf file. Also, if I create my own custom conf file, and specify it as part of --configurations (in the spark-submit), will it solve my use-case?


2 Answers


One way to do it, is the following: (The tricky part is that you might need to setup the environment variables on both executor and driver parameters)

spark-submit   \
 --driver-memory 2g \
 --executor-memory 4g \
 --conf spark.executor.instances=4 \
 --conf spark.driver.extraJavaOptions="-DENV_KEY=ENV_VALUE" \
 --conf spark.executor.extraJavaOptions="-DENV_KEY=ENV_VALUE" \
 --master yarn \
 --deploy-mode cluster\
 --class com.industry.class.name \

I have tested it in EMR and client mode but should work on cluster mode as well.


For future reference you could directly pass the environment variable when creating the EMR cluster using the Configurations parameter as described in the docs here.

Specifically, the spark-defaults file can be modified by passing a configuration JSON as follows:

    'Classification': 'spark-defaults',
    'Properties': {
        'spark.yarn.appMasterEnv.[EnvironmentVariableName]' = 'some_value',
        'spark.executorEnv.[EnvironmentVariableName]': 'some_other_value'

Where spark.yarn.appMasterEnv.[EnvironmentVariableName] would be used to pass a variable in cluster mode using YARN (here). And spark.executorEnv.[EnvironmentVariableName] to pass a variable to the executor process (here).