Convert CSV to required format for import into DynamoDB using AWS Datapipeline

Question

The AWS docs to import data from S3 into a Dynamo DB table using Data Pipeline (https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-part1.html) references an S3 file (s3://elasticmapreduce/samples/Store/ProductCatalog) which is in this format:

https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-pipelinejson-verifydata2.html?_sm_ovs=2DtvnqvHTVHW7q50vnqJqRQFVVnqZvnqMVVVVVVsV

Question is... how do I get a CSV of say 4 millions rows into this format in the first place? Is there a utlity for that?

Thanks for any suggestions... I've had a good google and haven't turned up anything.

Perhaps the intent is always to export data first from Dynamo to S3 (back it up) and then you can always import that back up... and thus you've got the file in the right format.... But not so much an initial import into Dynamo workflow which I'm trying to achieve. — Andrew Duffy
I did this once via a custom pipeline job. Not posting as an answer as I don't have the link or a copy of what I exactly used. It was something like this though: github.com/awslabs/data-pipeline-samples/blob/master/samples/… — stevepkr84

joshtok joshtok · Accepted Answer · 2016-08-09T19:35:06

steveprk84 already linked to this in his response, but I wanted to call it out: https://github.com/awslabs/data-pipeline-samples/tree/master/samples/DynamoDBImportCSV

Hive on EMR supports DynamoDB as an external table type. This sample uses a HiveActivity to create external Hive tables pointing to the target Dynamo table and the source CSV, and then it executes a Hive query to copy the data from one to the other.

Convert CSV to required format for import into DynamoDB using AWS Datapipeline

2 Answers