I exported a DynamoDB table using an AWS Data Pipeline with DataNodes > S3BackupLocation > Compression set to GZIP
. I expected compressed output with a .gz
extension, but I got uncompressed output with no extension.
Further reading reveals that the compression field "is only supported for use with Amazon Redshift and when you use S3DataNode with CopyActivity."
How can I get a gzipped backup of my DynamoDB table into S3? Do I have to resort to downloading all the files, gzipping them, and uploading them? Is there a way to make the pipeline work with CopyActivity? Is there a better approach?
I've been experimenting with using Hive for the export, but I haven't yet found a way to get the formatting right on the output. It needs to match the format below so EMR jobs can read it alongside data from another source.
{"col1":{"n":"596487.0550532"},"col2":{"s":"xxxx-xxxx-xxxx"},"col3":{"s":"xxxx-xxxx-xxxx"}}
{"col1":{"n":"234573.7390354"},"col2":{"s":"xxxx-xxxx-xxxx"},"col3":{"s":"xxxx-xxxxx-xx"}}
{"col2":{"s":"xxxx-xxxx-xxxx"},"col1":{"n":"6765424.7390354"},"col3":{"s":"xxxx-xxxxx-xx"}}