Creating AWS EMR cluster with spark step using lambda function fails with "Local file does not exist"

Question

I'm trying to spin up an EMR cluster with a Spark step using a Lambda function.

Here is my lambda function (python 2.7):

import boto3

def lambda_handler(event, context):
    conn = boto3.client("emr")        
    cluster_id = conn.run_job_flow(
        Name='LSR Batch Testrun',
        ServiceRole='EMR_DefaultRole',
        JobFlowRole='EMR_EC2_DefaultRole',
        VisibleToAllUsers=True,
        LogUri='s3n://aws-logs-171256445476-ap-southeast-2/elasticmapreduce/',
        ReleaseLabel='emr-5.16.0',
        Instances={
            "Ec2SubnetId": "<my-subnet>",
            'InstanceGroups': [
                {
                    'Name': 'Master nodes',
                    'Market': 'ON_DEMAND',
                    'InstanceRole': 'MASTER',
                    'InstanceType': 'm3.xlarge',
                    'InstanceCount': 1,
                },
                {
                    'Name': 'Slave nodes',
                    'Market': 'ON_DEMAND',
                    'InstanceRole': 'CORE',
                    'InstanceType': 'm3.xlarge',
                    'InstanceCount': 2,
                }
            ],
            'KeepJobFlowAliveWhenNoSteps': False,
            'TerminationProtected': False
        },
        Applications=[{
            'Name': 'Spark',
            'Name': 'Hive'
        }],
        Configurations=[
          {
            "Classification": "hive-site",
            "Properties": {
              "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
            }
          },
          {
            "Classification": "spark-hive-site",
            "Properties": {
              "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
            }
          }
        ],
        Steps=[{
            'Name': 'mystep',
            'ActionOnFailure': 'TERMINATE_CLUSTER',
            'HadoopJarStep': {
                'Jar': 's3://elasticmapreduce/libs/script-runner/script-runner.jar',
                'Args': [
                    "/home/hadoop/spark/bin/spark-submit", "--deploy-mode", "cluster",
                    "--master", "yarn-cluster", "--class", "org.apache.spark.examples.SparkPi", 
                    "s3://support.elasticmapreduce/spark/1.2.0/spark-examples-1.2.0-hadoop2.4.0.jar", "10"
                ]
            }
        }],
    )
    return "Started cluster {}".format(cluster_id)

The cluster is starting up, but when trying to execute the step it fails. The error log is containing the following exception:

Exception in thread "main" java.lang.RuntimeException: Local file does not exist.
    at com.amazon.elasticmapreduce.scriptrunner.ScriptRunner.fetchFile(ScriptRunner.java:30)
    at com.amazon.elasticmapreduce.scriptrunner.ScriptRunner.main(ScriptRunner.java:56)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

So it seems like the script-runner is not understanding to pick up the .jar file from S3?

Any help appreciated...

Dominic Nguyen Dominic Nguyen · Accepted Answer · 2018-07-27T03:11:48

Not all EMR pre-built with ability to copy your jar, script from S3 so you must do that in bootstrap steps:

BootstrapActions=[
    {
        'Name': 'Install additional components',
        'ScriptBootstrapAction': {
            'Path': code_dir + '/scripts' + '/emr_bootstrap.sh'
        }
    }
],

And here is what my bootstrap does

#!/bin/bash
HADOOP="/home/hadoop"
BUCKET="s3://<yourbucket>/<path>"

# Sync jars libraries
aws s3 sync ${BUCKET}/jars/ ${HADOOP}/
aws s3 sync ${BUCKET}/scripts/ ${HADOOP}/

# Install python packages
sudo pip install --upgrade pip
sudo ln -s /usr/local/bin/pip /usr/bin/pip
sudo pip install psycopg2 numpy boto3 pythonds

Then you can call your script and jar like this

 {
        'Name': 'START YOUR STEP',
        'ActionOnFailure': 'TERMINATE_CLUSTER',
        'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': [
                "spark-submit", "--jars", ADDITIONAL_JARS,
                "--py-files", "/home/hadoop/modules.zip",
                "/home/hadoop/<your code>.py"
            ]
        }
    },

Creating AWS EMR cluster with spark step using lambda function fails with "Local file does not exist"

2 Answers