I am new to AWS and am trying to port a python-based image processing application to the cloud. Our application scenario is similar to the Batch Processing scenario described here [media.amazonwebservices.com/architecturecenter/AWS_ac_ra_batch_03.pdf]
Specifically the steps involved are:
- Receive a large number of images (>1000) and one CSV file containing image metadata
- Parse CSV file and create a database (using dynamoDB).
- Push images to the cloud (using S3), and push messages of form
(bucketname, keyname)
to an input queue (using SQS). - "Pop" messages from the input queue
- Fetch appropriate image data from S3, and metadata from dynamoDB.
- Do the processing
- Update the corresponding entry for that image in dynamoDB
- Save results to S3
- Save a message in output queue (SQS) which feeds the next part of the pipeline.
Steps 4-9 would involve the use of EC2 instances.
From the boto documentation and tutorials online, I have understood how to incorporate S3, SQS, and dynamoDB into the pipeline. However, I am unclear on how exactly to proceed with the EC2 inclusion. I tried looking at some example implementations online, but couldn't figure out what the EC2 machine should do to make our batch image processing application work
- Use a
BOOTSTRAP_SCRIPT
with an infinite loop that constantly polls the input queue ad processes messages if available. This is what I think is being done in the Django-PDF example on AWS blog http://aws.amazon.com/articles/Python/3998 - Use
boto.services
to take care of all the details of reading messages, retrieving and storing files in S3, writing messages etc. This is what is used in the monster muck mash-up example http://aws.amazon.com/articles/Python/691
Which of the above methods is preferred for batch processing applications, or is there a better way? Also, for each of the above how do I incorporate the use of Auto-scaling group to manage EC2 machines based on load in the input queue. Any help in this regards would be really appreciated. Thank you.