1
votes

Please can someone help? I'm trying to do exactly this; I cannot create an EMR environment with a Spark installation from within a Data Pipeline configuration from within the AWS console. I choose 'Run job on an EMR cluster', the EMR cluster is always created with Pig and Hive as default, not Spark.

I understand that I can choose Spark as a bootstrap action as said here, but when I do I get this message: Name: xxx.xxxxxxx.processing.dp
Build using a template: Run job on an Elastic MapReduce cluster

Parameters: EC2 key pair (optional): xxx_xxxxxxx_emr_key EMR step(s):
spark-submit --deploy-mode cluster s3://xxx.xxxxxxx.scripts.bucket/CSV2Parquet.py s3://xxx.xxxxxxx.scripts.bucket/

EMR Release Label: emr-4.3.0 Bootstrap action(s) (optional): s3://support.elasticmapreduce/spark/install-spark,-v,1.4.0.b

Where does the AMI bit go? And does the above look correct??

Here's the error I get when I activate the data pipeline: Unable to create resource for @EmrClusterObj_2017-01-13T09:00:07 due to: The supplied bootstrap action(s): 'bootstrap-action.6255c495-578a-441a-9d05-d03981fc460d' are not supported by release 'emr-4.3.0'. (Service: AmazonElasticMapReduce; Status Code: 400; Error Code: ValidationException; Request ID: b1b81565-d96e-11e6-bbd2-33fb57aa2526)

If I specify a later version of the EMR, do I get Spark installed as default?

Many thanks for any help here. Regards.

1

1 Answers

3
votes

That install-spark bootstrap action is only for 3.x AMI versions. If you are using a releaseLabel (emr-4.x or beyond), the applications to install are specified in a different way.

I myself have never used Data Pipeline, but I see that if, when you are creating a pipeline, you click "Edit in Architect" at the bottom, you can then click on the EmrCluster node and select Applications from the "Add an optional field..." dropdown. That is where you may add Spark.