AWS Data Pipeline, Best way to Structure Data in S3 for DynamoDB Mass Import?

Question

I'm looking at migrating a massive database to Amazon's DynamoDB (think 150 million plus records).
I'm currently storing these records in Elasticsearch.

I'm reading up on Data Pipeline and you can import into DynamoDB from S3 using a TSV, CSV or JSON file.

It seems the best way to go is a JSON file and I've found two examples of how it should be structured:

From AWS:

{"Name"ETX {"S":"Amazon DynamoDB"}STX"Category"ETX {"S":"Amazon Web Services"}}
{"Name"ETX {"S":"Amazon push"}STX"Category"ETX {"S":"Amazon Web Services"}}
{"Name"ETX {"S":"Amazon S3"}STX"Category"ETX {"S":"Amazon Web Services"}}

From Calorious' Blog:

{"Name": {"S":"Amazon DynamoDB"},"Category": {"S":"Amazon Web Services"}}
{"Name": {"S":"Amazon push"},"Category": {"S":"Amazon Web Services"}}
{"Name": {"S":"Amazon S3"},"Category": {"S":"Amazon Web Services"}}

So, my questions are the following:

Do I have to put a literal 'START of LINE (STX)'?
How reliable is this method? Should I be concerned about failed uploads? There doesn't seem to be a way to do error handling so I do I just assume that AWS got it right?
Is there an ideal size of file? For example should I break up the database into say 100K chunks of records and store each 100k chunk in one file?

I want to get this right the first time and not incur extra charges as apparently you get charged when you're right or wrong in your setup.

Any specific parts/links to the manual that I missed would also be greatly appreciated.

Garet Jax Garet Jax · Accepted Answer · 2019-03-08T01:05:40

I am doing this exact thing right now. In fact I extracted 340 million rows using Data pipeline, transformed them using Lambda and am importing them right now using pipeline.

A couple of things:

1) JSON is a good way to go.

2) On the export, AWS limits each file to 100,000 records. Not sure if this is required or just a design decision.

3) In order to use the pipeline for import, there is a requirement to have a manifest file. This was news to me. I had an example from the export which you won't have. Without it your import probably won't work. It's structure is:

{"name":"DynamoDB-export","version":3,
"entries": [
{"url":"s3://[BUCKET_NAME]/2019-03-06-20-17-23/dd3906a0-a548-453f-96d7-ee492e396100-transformed","mandatory":true},
...
]}

4) Calorious' Blog has the format correct. I am not sure if the "S" needs to be lower case - mine all are. Here is an example row from my import file:

{"x_rotationRate":{"s":"-7.05723"},"x_acceleration":{"s":"-0.40001"},"altitude":{"s":"0.5900"},"z_rotationRate":{"s":"1.66556"},"time_stamp":{"n":"1532710597553"},"z_acceleration":{"s":"0.42711"},"y_rotationRate":{"s":"-0.58688"},"latitude":{"s":"37.3782895682606"},"x_quaternion":{"s":"-0.58124"},"x_user_accel":{"s":"0.23021"},"pressure":{"s":"101.0524"},"z_user_accel":{"s":"0.02382"},"cons_key":{"s":"index"},"z_quaternion":{"s":"-0.48528"},"heading_angle":{"s":"-1.000"},"y_user_accel":{"s":"-0.14591"},"w_quaternion":{"s":"0.65133"},"y_quaternion":{"s":"-0.04934"},"rotation_angle":{"s":"221.53970"},"longitude":{"s":"-122.080872377186"}}

AWS Data Pipeline, Best way to Structure Data in S3 for DynamoDB Mass Import?

From AWS:

3 Answers