Please can someone help? I'm trying to do exactly this; I cannot create an EMR environment with a Spark installation from within a Data Pipeline configuration from within the AWS console. I choose 'Run job on an EMR cluster', the EMR cluster is always created with Pig and Hive as default, not Spark.
I understand that I can choose Spark as a bootstrap action as said here, but when I do I get this message:
Name: xxx.xxxxxxx.processing.dp
Build using a template: Run job on an Elastic MapReduce cluster
Parameters:
EC2 key pair (optional): xxx_xxxxxxx_emr_key
EMR step(s):
spark-submit --deploy-mode cluster s3://xxx.xxxxxxx.scripts.bucket/CSV2Parquet.py s3://xxx.xxxxxxx.scripts.bucket/
EMR Release Label: emr-4.3.0 Bootstrap action(s) (optional): s3://support.elasticmapreduce/spark/install-spark,-v,1.4.0.b
Where does the AMI bit go? And does the above look correct??
Here's the error I get when I activate the data pipeline: Unable to create resource for @EmrClusterObj_2017-01-13T09:00:07 due to: The supplied bootstrap action(s): 'bootstrap-action.6255c495-578a-441a-9d05-d03981fc460d' are not supported by release 'emr-4.3.0'. (Service: AmazonElasticMapReduce; Status Code: 400; Error Code: ValidationException; Request ID: b1b81565-d96e-11e6-bbd2-33fb57aa2526)
If I specify a later version of the EMR, do I get Spark installed as default?
Many thanks for any help here. Regards.