Best way to deduplicate from firehose

Question

What is the best and most cost saving way to deduplicate events written from firehose to s3?

My scenario: I have multiple sources, which writes their events as JSON to a kinesis firehose stream. The stream writes the events to a s3 bucket. Than the events should get analyzed with athena.

So, because firehose not guarantees not to have duplicates, I have somehow to deduplicate the data. And I also have to somehow partition them for athena.

The ways I came up with until now are:

Use a EMR cluster (every day for example) to do deduplating and partitioning. But this is cost intensive and not good to run more often than a day, to be cost efficiant
Use a scheduled lambda function, which deduplicates a flowing time window. And another lambda, which partition the the data. Costs: I don't know, bceause never used a lambda before.

Is there a better, and more elegant, and cost saving way?

Theo Theo · Accepted Answer · 2018-05-30T12:09:11

First of all I think you should think about how much it is worth weeding out duplicates, versus how often Firehose will actually deliver duplicates. I think it's exceedingly rare that you end up with duplicates because of Firehose itself, but if you have producers that could also end up sending duplicates to your Firehose you still might want to handle it of course.

The method you should use depends on your use case, and if you gave some more details about it, it might be easier to give you a more definitive answer.

If you don't have a lot of data you can pay the price on the read side instead of handling of for example rewriting the data. SELECT DISTINCT * FROM table should remove duplicate rows. If your queries contain aggregations you do SELECT column, COUNT(*) FROM (SELECT DISTINCT * FROM table) – or some variant of SELECT foo, MIN(bar), MIN(baz) GROUP BY 1 if you have a column that should be unique. Since Athena charges for scanned data and not compute resources this will not cost extra, but will of course be slower.

I wouldn't recommend that solution if you have a lot of data, and in that case I think you need to have an extra step in your pipeline anyway, because you also shouldn't query the data produced by a Firehose as-is. You need to create a partitioned table and add each hour, day, or month as its own partition (depending on exactly how much data we're talking about). You can do that without moving the data, but since you have to have an extra step anyway you could just as well stick the deduplication there too – and if you do look at using Glue ETL, it might be less overhead for you than EMR.

Best way to deduplicate from firehose

1 Answers