I really appreciate the suggestions (Chris & Jarmod). After trying and failing to break things programmatically into smaller chunks, I decided to look at the approach in general.
Through research I understood there were 4 options:
- Lambda Function - as per the above this fails with a timeout.
- AWS Pipeline - Doesn't have a template for importing CSV to DynamoDB
- Manual Entry - of 2.5m items? no thanks! :)
- Use an EC2 instance to load the data to RDS and use DMS to migrate to DynamoDB
The last option actually worked well. Here's what I did:
- Create an RDS database (I used the db.t2.micro tier as it was free) and created a blank table.
- Create an EC2 instance (free Linux tier) and:
- On the EC2 instance: use SCP to upload the CSV file to the ec2 instance
- On the EC2 instance: Firstly
Sudo yum install MySQL
to get the tools needed, then use mysqlimport
with the --local option to import the CSV file to the rds MySQL database, which took literally seconds to complete.
- At this point I also did some data cleansing to remove some white spaces and some character returns that had crept into the file, just using standard SQL queries.
- Using DMS I created a replication instance, endpoints for the source (rds) and target (dynamodb) databases, and finally created a task to import.
- The import took around 4hr 30m
- After the import, I removed the EC2, RDS, and DMS objects (and associated IAM roles) to avoid any potential costs.
Fortunately, I had a flat structure to do this against, and it was only one table. I needed the cheap speed of the dynamodb, otherwise, I'd have stuck to the RDS (I almost did halfway through the process!!!)
Thanks for reading, and best of luck if you have the same issue in the future.