How to ensure the table get scanned daily as the table size growing

Question

I have a DynamoDB table which stores metadata of items in S3 (i.e. images and files). Sometime the S3 items get deleted, but the metadata is not. So I run a process which scans the entire dynamoDB table to check if the S3 object still exists. If not, delete the dynamoDB row. But as the total number of objects increased, the scan took longer time to finish. I want to ensure that everything in DynamoDB get scanned everyday, no matter how big the table is. So looks for design suggestion to rewrite the scan tool, which need to be horizontal scalable with the table.

Currently, I'm using Parallel scan feature provided by dynamoDB to split the table into 1000 parts. And add more threads to scan each more part at same time. So that by increase parallelism, the entire scan finished in shorter time. And schedule the process to run at midnight. But I can see this method will fail when the table grow exceed some threshold when scanning one part takes longer than 1 day.

Do you know why the S3 object deletes are not being reflected in the metadata? Do you have an S3 Event configured to trigger a Lambda function on deletes, so that it can remove the entry? — John Rotenstein

vasha vasha · Accepted Answer · 2019-05-31T19:50:46

Instead of deleting first in S3, just mark the entry deleted in dynamo DB then later use the dynamo DB change feed to find and delete the object in S3. Beside making delete from S3 reliable it will also make the synchronous delete faster and provide strong consistency on the delete operation.

How to ensure the table get scanned daily as the table size growing

1 Answers