Deduplication using Kafka-Streams

Question

I want to deduplication in my kafka-streams application which uses state-store and using this very good example:

https://github.com/confluentinc/kafka-streams-examples/blob/5.5.0-post/src/test/java/io/confluent/examples/streams/EventDeduplicationLambdaIntegrationTest.java

I have few questions about this example.

As I correctly understand, this example briefly do this:

Message comes into input topic
Look at the store, if it does not exist, write to state-store and return
if it does exist drop the record, so the deduplication is applied.

But in the code example there is a time window size that you can determine. Also, retention time for the messages in the state-store. You can also check the record is in the store or not by giving timestamp timeFrom + timeTo

        final long eventTime = context.timestamp();

        final WindowStoreIterator<String> timeIterator = store.fetch(
            key,
            eventTime - leftDurationMs,
            eventTime + rightDurationMs
        );

What is the actual purpose for the timeTo and timeFrom ? I am not sure why I am checking the next time interval because I am checking the future messages that did not come to my topic yet ?

My second question does this time interval related and should HIT the previous time window ?

If I am able to search the time interval by giving timeTo and timeFrom, why time window size is important ?

If I give the window Size 12 hours, am I able to guarantee that I am deduplicated messages for 12 hours ?

I think like this:

First message comes with key "A" in the first minute of the application start-up, after 11 hours, the message with a key "A" comes again. Can I catch this duplicated message by giving enough time interval like eventTime - 12hours ?

Thanks for any ideas !

fengd fengd · Accepted Answer · 2020-10-20T03:10:35

TimeWindow size decides how long you wants the "duplication" runs, no duplication forever or just during 5 minutes. Kafka has to store these records. A large timewindow may consume a large resource of your server.

TimeFrom and TimeTo, cause your record(event) may arrive/process late in kafka, so the event-time of the record is 1 minute ago, not now. Kafka is process an "old" record, and that's it needs to take care of records which are not that old, relatively "future" records to the "old" one.

Deduplication using Kafka-Streams

1 Answers