2
votes

I exported a DynamoDB table using an AWS Data Pipeline with DataNodes > S3BackupLocation > Compression set to GZIP. I expected compressed output with a .gz extension, but I got uncompressed output with no extension.

Further reading reveals that the compression field "is only supported for use with Amazon Redshift and when you use S3DataNode with CopyActivity."

How can I get a gzipped backup of my DynamoDB table into S3? Do I have to resort to downloading all the files, gzipping them, and uploading them? Is there a way to make the pipeline work with CopyActivity? Is there a better approach?

I've been experimenting with using Hive for the export, but I haven't yet found a way to get the formatting right on the output. It needs to match the format below so EMR jobs can read it alongside data from another source.

{"col1":{"n":"596487.0550532"},"col2":{"s":"xxxx-xxxx-xxxx"},"col3":{"s":"xxxx-xxxx-xxxx"}}
{"col1":{"n":"234573.7390354"},"col2":{"s":"xxxx-xxxx-xxxx"},"col3":{"s":"xxxx-xxxxx-xx"}}
{"col2":{"s":"xxxx-xxxx-xxxx"},"col1":{"n":"6765424.7390354"},"col3":{"s":"xxxx-xxxxx-xx"}}
1

1 Answers

2
votes

I too have been looking for how to do this. It is such a basic request that I'm surprised that it's not part of a base data pipeline workflow.

After days of investigation and experimentation, I've found 2 mechanisms:

1) use ShellCommandActivity to launch a couple of aws cli commands (s3 cp, gzip) to download from s3, gzip locally, then re-upload to s3. Here are the relevant parts:

{
    "name": "CliActivity",
    "id": "CliActivity",
    "runsOn": {
        "ref": "Ec2Instance"
    },
    "type": "ShellCommandActivity",
    "command": "(sudo yum -y update aws-cli) && (#{myCopyS3ToLocal}) && (#{myGzip}) && (#{myCopyLocalToS3})"
},

"values": {
    "myCopyS3ToLocal": "aws s3 cp s3://your-bucket/your-folders/ --recursive",
    "myGzip": "for file in /tmp/random-date/*; do gzip \"$file\"; done",
    "myCopyLocalToS3": "aws s3 cp /tmp/random-date s3://your-bucket/your-folders-gz/ --recursive"
}

2) create a separate EMR cluster, then create a data pipeline that uses that EMR cluster to run S3DistCp (s3-dist-cp).

{
    "name": "CliActivity",
    "id": "CliActivity",
    "runsOn": {
        "ref": "Ec2Instance"
    },
    "type": "ShellCommandActivity",
    "command": "(sudo yum -y update aws-cli) && (#{myAWSCLICmd})"
},

"values": {
    "myAWSCLICmd": "aws emr add-steps --cluster-id j-XXXXYYYYZZZZ --region us-east-1 --steps Name=\"S3DistCp command runner\",Jar=\"command-runner.jar\",Args=[\"s3-dist-cp\",\"--s3Endpoint=s3.amazonaws.com\",\"--src=s3://your-bucket/your-folders/\",\"--dest=s3://your-bucket/your-folders-gz/\",\"--outputCodec=gz\"]"
}

Between the two of them, I like the second because s3distcp can automatically delete the source s3 files. However, it requires a separate EMR cluster to run (higher cost). Or you can add additional step to #1 to do the deletion.

Also, if you want to parameterize, you may need to directly inline the values so that you can take advantage of things like #{format(@scheduledStartTime,'YYYY-MM-dd_hh.mm')}.