2
votes

I am looking for best tools available on AWS to schedule a task/job that will query external HTTP server. The external server replies with XML files so the files ideally would be stored on S3 then process, and polished data moved to Redshift. I was looking at AWS Data Pipeline and Amazon EMR but they mostly focus on moving the data within AWS. Any suggestion? thanks

2

2 Answers

0
votes

Amazon Simple Workflow Service (SWF) may be a solution. I'm sure SWF can do that, but it's a little heavy. You need more programming then Data Pipeline.

Here is different between SWF & Data Pipeline:

Q: How is AWS Data Pipeline different from Amazon Simple Workflow Service?

While both services provide execution tracking, retry and exception-handling capabilities, and the ability to run arbitrary actions, AWS Data Pipeline is specifically designed to facilitate the specific steps that are common across a majority of data-driven workflows – inparticular, executing activities after their input data meets specific readiness criteria, easily copying data between different data stores, and scheduling chained transforms. This highly specific focus means that its workflow definitions can be created rapidly and with no code or programming knowledge. Ref.

Or you can just use SWF create a schedule, then put process logic to AWS Lambda. Use SWF trigger AWS Lambda function will be simpler.

0
votes

If you are using AWS DataPipeline, you can write a ShellCommandActivity (python script or any cust exe) that can fetch the XML from the target server, munge it to CSV and persist it to s3, you could then use RedshiftCopyActivity to instruct Redshift to load the file from that location.