Using AWS Data pipeline - EMR vs EC2

Question

I would like to use AWS Data Pipeline to execute an ETL process. Suppose that my process has a small input file and I am would like to use a custom jar or python script to make data transformations. I dont see any reason to use a cluster EMR to make this simple data step. So, I would like to execute this data step in a EC2 single instance.

Looking at the AWS DataPipeline at EMRActivity object, i just see the option to run using an EMR cluster. Is there way to run a computation step inside a EC2 instance? Is it th best solution for this use case? Or Is it better to setup a small EMR (with a single node) and execute a hadoop job?

ChristopherB ChristopherB · Accepted Answer · 2015-10-06T22:49:04

If you don't need the EMR cluster or Hadoop framework and your execution can easily run on a single instance than you can use the ShellCommandActivity associated with an Ec2Resource (an instance) to perform the work. Simple example is at http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-getting-started.html

Using AWS Data pipeline - EMR vs EC2

1 Answers