Batch file processing in AWS using Data Pipeline

Question

I have a requirement of reading a csv batch file that was uploaded to s3 bucket, encrypt data in some columns and persist this data in a Dynamo DB table. While persisting each row in the DynamoDB table, depending on the data in each row, I need to generate an ID and store that in the DynamoDB table too. It seems AWS Data pipeline allows to create a job to import S3 bucket files into DynanoDB, but I can't find a way to add a custom logic there to encrypt some of the column values in the file and add custom logic to generate the id mentioned above.

Is there any way that I can achieve this requirement using AWS Data Pipeline? If not what would the best approach that I can follow using AWS services?

HIMANSHU GOYAL HIMANSHU GOYAL · Accepted Answer · 2021-03-21T15:21:57

We also have a situation where we need fetch data from S3 and populate it to DynamoDb after performing some transformations (business logic).

We also use AWS DataPipeline for this process.

We first trigger a EMR cluster from Data Pipeline where we fetch the data from S3 and then transform it and populate the DynamoDB(DDB). You can include all the logic you require in the EMR cluster.

We have a timer set in the pipeline which triggers the EMR cluster every day once to perform the task.

This can be having additional costs too.

Batch file processing in AWS using Data Pipeline

1 Answers