0
votes

I want to import a large csv file (around 1gb with 2.5m rows and 50 columns) into a DynamoDb, so have been following this blog from AWS.

However it seems I'm up against a timeout issue. I've got to ~600,000 rows ingested, and it falls over.

I think from reading the CloudWatch log that the timeout is occurring due to the boto3 read on the CSV file (it opens the entire file first, iterates through and batches up for writing)... I tried to reduce the file size (3 columns, 10,000 rows as a test), and I got a timeout after 2500 rows.

Any thoughts here?!

TIA :)

1
Have you check you're not being throttled by DynamoDB? This will be available as metrics in CloudWatchChris Williams
Thanks Chris, I checked this out, but no throttling is occurring (0 throttles counted). The provisioned read/write capacity units peaked around 400 I believe using an OnDemand table.Simon Pearson
When you say it times out, are you saying that the Lambda function's configured timeout is exceeded? What is it set to? And are you trying to read the entire CV into memory before batching writes to DynamoDB, or effectively streaming in the CSV?jarmod
If you're hitting timeouts, could you break it down into smaller chunks and import them all seperately?Chris Williams

1 Answers

0
votes

I really appreciate the suggestions (Chris & Jarmod). After trying and failing to break things programmatically into smaller chunks, I decided to look at the approach in general.

Through research I understood there were 4 options:

  1. Lambda Function - as per the above this fails with a timeout.
  2. AWS Pipeline - Doesn't have a template for importing CSV to DynamoDB
  3. Manual Entry - of 2.5m items? no thanks! :)
  4. Use an EC2 instance to load the data to RDS and use DMS to migrate to DynamoDB

The last option actually worked well. Here's what I did:

  • Create an RDS database (I used the db.t2.micro tier as it was free) and created a blank table.
  • Create an EC2 instance (free Linux tier) and:
  • On the EC2 instance: use SCP to upload the CSV file to the ec2 instance
  • On the EC2 instance: Firstly Sudo yum install MySQL to get the tools needed, then use mysqlimport with the --local option to import the CSV file to the rds MySQL database, which took literally seconds to complete.
  • At this point I also did some data cleansing to remove some white spaces and some character returns that had crept into the file, just using standard SQL queries.
  • Using DMS I created a replication instance, endpoints for the source (rds) and target (dynamodb) databases, and finally created a task to import.
  • The import took around 4hr 30m
  • After the import, I removed the EC2, RDS, and DMS objects (and associated IAM roles) to avoid any potential costs.

Fortunately, I had a flat structure to do this against, and it was only one table. I needed the cheap speed of the dynamodb, otherwise, I'd have stuck to the RDS (I almost did halfway through the process!!!)

Thanks for reading, and best of luck if you have the same issue in the future.