3
votes

I was going through this Apache Spark documentation, and it mentions that:

When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file.

I am running my EMR cluster on AWS data pipeline. I wanted to know that where do I have to edit this conf file. Also, if I create my own custom conf file, and specify it as part of --configurations (in the spark-submit), will it solve my use-case?

2

2 Answers

2
votes

One way to do it, is the following: (The tricky part is that you might need to setup the environment variables on both executor and driver parameters)

spark-submit   \
 --driver-memory 2g \
 --executor-memory 4g \
 --conf spark.executor.instances=4 \
 --conf spark.driver.extraJavaOptions="-DENV_KEY=ENV_VALUE" \
 --conf spark.executor.extraJavaOptions="-DENV_KEY=ENV_VALUE" \
 --master yarn \
 --deploy-mode cluster\
 --class com.industry.class.name \
   assembly-jar.jar

I have tested it in EMR and client mode but should work on cluster mode as well.

0
votes

For future reference you could directly pass the environment variable when creating the EMR cluster using the Configurations parameter as described in the docs here.

Specifically, the spark-defaults file can be modified by passing a configuration JSON as follows:

{
    'Classification': 'spark-defaults',
    'Properties': {
        'spark.yarn.appMasterEnv.[EnvironmentVariableName]' = 'some_value',
        'spark.executorEnv.[EnvironmentVariableName]': 'some_other_value'
    }
},

Where spark.yarn.appMasterEnv.[EnvironmentVariableName] would be used to pass a variable in cluster mode using YARN (here). And spark.executorEnv.[EnvironmentVariableName] to pass a variable to the executor process (here).