2
votes

I am new to AWS and am trying to port a python-based image processing application to the cloud. Our application scenario is similar to the Batch Processing scenario described here [media.amazonwebservices.com/architecturecenter/AWS_ac_ra_batch_03.pdf]

Specifically the steps involved are:

  1. Receive a large number of images (>1000) and one CSV file containing image metadata
  2. Parse CSV file and create a database (using dynamoDB).
  3. Push images to the cloud (using S3), and push messages of form (bucketname, keyname) to an input queue (using SQS).
  4. "Pop" messages from the input queue
  5. Fetch appropriate image data from S3, and metadata from dynamoDB.
  6. Do the processing
  7. Update the corresponding entry for that image in dynamoDB
  8. Save results to S3
  9. Save a message in output queue (SQS) which feeds the next part of the pipeline.

Steps 4-9 would involve the use of EC2 instances.

From the boto documentation and tutorials online, I have understood how to incorporate S3, SQS, and dynamoDB into the pipeline. However, I am unclear on how exactly to proceed with the EC2 inclusion. I tried looking at some example implementations online, but couldn't figure out what the EC2 machine should do to make our batch image processing application work

  1. Use a BOOTSTRAP_SCRIPT with an infinite loop that constantly polls the input queue ad processes messages if available. This is what I think is being done in the Django-PDF example on AWS blog http://aws.amazon.com/articles/Python/3998
  2. Use boto.services to take care of all the details of reading messages, retrieving and storing files in S3, writing messages etc. This is what is used in the monster muck mash-up example http://aws.amazon.com/articles/Python/691

Which of the above methods is preferred for batch processing applications, or is there a better way? Also, for each of the above how do I incorporate the use of Auto-scaling group to manage EC2 machines based on load in the input queue. Any help in this regards would be really appreciated. Thank you.

1

1 Answers

1
votes

You should write an application (using Python and Boto for example) that will do the SQS polling and interact with S# and DynamoDB.

This application must be installed at boot time on the EC2 instance. Several options are available (CloudFormation, Chef, CloudInit and user-data or Custom AMI) but I would suggest you to start with User-Data as described here http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html

You also must ensure your instances has proper privileges to talk to S3, SQS and DynamodDB. You must create IAM permissions for this. Then attach the permissions to a role and the role to your instance. Detailled procedure is available in the doc at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html