I have a DynamoDB table which stores metadata of items in S3 (i.e. images and files). Sometime the S3 items get deleted, but the metadata is not. So I run a process which scans the entire dynamoDB table to check if the S3 object still exists. If not, delete the dynamoDB row. But as the total number of objects increased, the scan took longer time to finish. I want to ensure that everything in DynamoDB get scanned everyday, no matter how big the table is. So looks for design suggestion to rewrite the scan tool, which need to be horizontal scalable with the table.
Currently, I'm using Parallel scan feature provided by dynamoDB to split the table into 1000 parts. And add more threads to scan each more part at same time. So that by increase parallelism, the entire scan finished in shorter time. And schedule the process to run at midnight. But I can see this method will fail when the table grow exceed some threshold when scanning one part takes longer than 1 day.