1
votes

I have a data pipeline running every hour, running a HiveCopyActivity to select the past hour's data from DynamoDB into S3. The table I'm selecting from has a hash key VisitorID and range key Timestamp, around 4 million rows and is 7.5GB in size. To reduce the time taken for the job, I created a global secondary index on Timestamp but after monitoring Cloudwatch, it seems that HiveCopyActivity doesn't use the index. I've read through all the relevant AWS documentation but can't find any mention of indexes.

Is there a way to force data pipeline to use an index while filtering like this? If not, are there any alternative applications which could transfer hourly (or any other period) data from DynamoDB to S3?

1

1 Answers

0
votes

The DynamoDB EMR Hive adapter currently doesn't support using indexes, unfortunately. You would need to write your own sweeper that scans the index and outputs it to S3 - you can check out https://github.com/awslabs/dynamodb-import-export-tool for some basics to implementing the import/export pipe. The library is essentially a parallel scan framework for sweeping DDB tables.