0
votes

I wrote code with Scala to run a Cluster in EMR. Also, I have a Spark application written in Scala. I want to run this Spark application on EMR Cluster. But is it possible for me to do this in the first script (that launch EMR Cluster)? I want to do all of them with the SDK, not through the console or CLI. It has to be a kind of automatization, not a single manual job (or minimize manual job).

Basically;

Launch EMR Cluster -> Run Spark Job on EMR -> Terminate after job finished

How do I do it if possible? Thanks.

2
AWS SDK for what?Lamanus
@Lamanus for JavacPhoenix

2 Answers

2
votes
HadoopJarStepConfig sparkStepConf = new HadoopJarStepConfig()
  .withJar("command-runner.jar")
  .withArgs(params);  

final StepConfig sparkStep = new StepConfig()
  .withName("Spark Step")
  .withActionOnFailure("CONTINUE")
  .withHadoopJarStep(sparkStepConf);

AddJobFlowStepsRequest request = new AddJobFlowStepsRequest(clusterId)
  .withSteps(new ArrayList<StepConfig>(){{add(sparkStep);}});

AddJobFlowStepsResult result = emr.addJobFlowSteps(request);
return result.getStepIds().get(0);
0
votes

If you are looking just for automation you should read about Pipeline Orchestration-

  • EMR is the AWS service which allows you to run distributed applications
  • AWS DataPipeline is an Orchestration tool that allows you to run jobs (activities) on resources (EMR or even EC2)

If you'd just like to run a spark job consistently, I would suggest creating a data pipeline, and configuring your pipeline to have one step, which is to run the Scala spark jar on the master node using a "shellcommandactivity". Another benefit is that the jar you are running can be stored in AWS S3 (object storage service) and you'd just provide the s3 path to your DataPipeline and it will pick up that jar, log onto the EMR service it has brought up (with the configurations you've provided)- clone that jar on the master node, run the jar with the configuration provided in the "shellcommandactivity", and once the the job exits (successfully or with an error) it will kill the EMR cluster so you aren't paying for it and log the output

Please read more into it: https://aws.amazon.com/datapipeline/ & https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html

And if you'd like you can trigger this pipeline via the AWS SDK or even set the pipeline to run on a schedule