0
votes

Currently, trying to run a quantitative data processing pipeline utilizing RDS and EC2 instances on AWS. There is one portion of the pipeline that requires significant computing power but is not mission or time critical and therefore I would like to use a cluster of EC2 spot instances at that point.

I have been considering using the AWS Data Pipeline product in order to architect the pipeline. However, I am unsure on how to integrate the spot instances. AWS documentation suggests that spot instances can be utilized in an AWS EMR cluster using the Data Pipeline, but not outside of them. Looking for suggestions or best practices.

1

1 Answers

0
votes

Spot instances can be used for both EC2 and EMR resources in data pipeline.

For an ec2 instance, you'll need to set the bidPrice attribute on the resource. The pipeline definition for the ec2 resource should look like this.

    { 
      "id": "EC2Instance",
      "type": "Ec2Resource",
      "terminateAfter": "1 Hour",
      "spotBidPrice": "<my bid price from 0 to 20.0>"    
     }

For an emr cluster, you'll need to set the taskInstanceBidPrice attribute on the resource. The pipeline definition for the emr resource should look like this.

    {
      "id" : "MyEmrCluster",
      "type" : "EmrCluster",
      "taskInstanceBidPrice": "<my bid price from 0 to 20.0>",
      "keypair" : "my-key-pair",
      "masterInstanceType" : "m3.xlarge",
      "coreInstanceType" : "m3.xlarge",
      "coreInstanceCount" : "10",
      "taskInstanceType" : "m3.xlarge",
      "taskInstanceCount": "10",
      "releaseLabel": "emr-4.1.0",
      "applications": ["spark", "hive", "pig"],
      "configuration": {"ref":"myConfiguration"}  
    }