Can I actually run a Spark job on a mocked EMR cluster?

Question

Using moto I was able to mock an EMR cluster:

with moto.mock_emr():
    client = boto3.client('emr', region_name='us-east-1')
    client.run_job_flow(
        Name='my_cluster',
        Instances={
            'MasterInstanceType': 'c3.xlarge',
            'SlaveInstanceType': 'c3.xlarge',
            'InstanceCount': 3,
            'Placement': {'AvailabilityZone': 'us-east-1a'},
            'KeepJobFlowAliveWhenNoSteps': True,
        },
        VisibleToAllUsers=True,
    )
    summary = client.list_clusters()
    cluster_id = summary["Clusters"][0]["Id"]
    res = client.add_job_flow_steps(
        JobFlowId=cluster_id,
        Steps=[
            {
                "Name": "foo_step",
                "ActionOnFailure": "CONTINUE",
                "HadoopJarStep": {"Args": [], "Jar": "command-runner.jar"},
            }
        ],
    )

The added step seems to be in a STARTING state all the time. Is it possible to actually submit a Spark job to the mocked cluster and run it there?

I am building a utility that submit jobs to EMR clusters and I want to test it. I want to run a trivial Spark job using this utility and this is where the question is coming from. Note that I'm not interested in a Spark cluster or testing the correctness of the submitted Spark job. I am actually more interested in testing the flow of submitting a job to an EMR and examining the results (that ideally should be persisted on a mocked S3 bucket).

Evandro Mendes Evandro Mendes · Accepted Answer · 2021-02-18T20:51:29

Its not possible, mock_emr is just a mock or (proxy to real request). You can develop an spark with mock_s3 and send conf to spark to read mocked s3 for your purpose

Can I actually run a Spark job on a mocked EMR cluster?

1 Answers