What is the best and most cost saving way to deduplicate events written from firehose to s3?
My scenario: I have multiple sources, which writes their events as JSON to a kinesis firehose stream. The stream writes the events to a s3 bucket. Than the events should get analyzed with athena.
So, because firehose not guarantees not to have duplicates, I have somehow to deduplicate the data. And I also have to somehow partition them for athena.
The ways I came up with until now are:
- Use a EMR cluster (every day for example) to do deduplating and partitioning. But this is cost intensive and not good to run more often than a day, to be cost efficiant
- Use a scheduled lambda function, which deduplicates a flowing time window. And another lambda, which partition the the data. Costs: I don't know, bceause never used a lambda before.
Is there a better, and more elegant, and cost saving way?