How to get AWS EMR cluster id and step id from inside the spark application step submitted

Question

Scenario:
I am running the Spark Scala job in AWS EMR. Now my job dumps some metadata unique to that application. Now for dumping I am writing at location "s3://bucket/key/<APPLICATION_ID>" Where ApplicationId is val APPLICATION_ID: String = getSparkSession.sparkContext.getConf.getAppId

Now basically is there a way to write at s3 location something like "s3://bucket/key/<emr_cluster_id>_<emr_step_id>". How can i get the cluster id and step id from inside the spark Scala application.

Writing in this way will help me debug and help me in reaching the cluster based and debug the logs.

Is there any way other than reading the "/mnt/var/lib/info/job-flow.json" ?

PS: I am new to spark, scala and emr . Apologies in advance if this is an obvious query.

BjornO BjornO · Accepted Answer · 2021-05-20T11:16:39

With PySpark on EMR, EMR_CLUSTER_ID and EMR_STEP_ID are available as environment variables (confirmed on emr-5.30.1).

They can be used in code as follows:

import os
emr_cluster_id = os.environ.get('EMR_CLUSTER_ID')
emr_step_id = os.environ.get('EMR_STEP_ID')

I can't test but the following similar code should work in Scala.

val emr_cluster_id = sys.env.get("EMR_CLUSTER_ID")
val emr_step_id = sys.env.get("EMR_STEP_ID")

Since sys.env is simply a Map[String, String] its get method returns an Option[String], which doesn't fail if these environment variables don't exist. If you want to raise an Exception you could use sys.env("EMR_x_ID")

The EMR_CLUSTER_ID and EMR_STEP_ID variables are visible in the Spark History Server UI under the Environment tab, alongside with other variables that may be of interest.

How to get AWS EMR cluster id and step id from inside the spark application step submitted

3 Answers