4
votes

I have a DynamoDB table that has 1.5 million records / 2GB. How to export this to an S3?

The AWS data pipeline method to do this worked with a small table. But i am facing issues with exporting the 1.5 million record table to my S3.

At my initial trial, the pipeline job took 1 hour and failed with

java.lang.OutOfMemoryError: GC overhead limit exceeded

I had increased the namenode heap size by supplying a hadoop-env configuration object to the instances inside the EMR cluster by following this link

After increasing the heapsize my next job run attempt failed after 1 hour with another error as seen in the screenshot attached. I am not sure what to do here to fix this completely.

enter image description here enter image description here

Also while checking the AWS Cloudwatch graphs of the instances in the EMR cluster. The core node was continuously at a 100% CPU usage.

The EMR cluster instance types (master and core node) were m3.2xlarge.

1
This might be a long shot, but does it work on newer instance types such as m5. The m3s are legacyChris Williams
you can define a hive table using dynamodb emr connector and run a spark job which import data from dynamodb and export it into s3Abdelrahman Maharek

1 Answers

3
votes

The issue was with the maptasks not running efficiently. The core node was hitting 100% CPU usage. I upgraded the cluster instance types to one of the compute C series available and the export worked with no issues.