2
votes

I'm working on a stream processor which has KStream-KStream and KStream-KTable join and also uses a state store remove the duplicates while doing the join.

We have been performing load tests for this processor and the messages in the topic are growing, which is causing the stream processor to take long time (~1 hour) to consume the changelog topics and initialize the state stores when there's a restart/redeployment happens.

We have a retention of 7 days for the topics.

1
This is more a description of your observation than a question? What do you want to know? Are you aware of StandbyTasks? What version do you use? And please, ask a question :) - Matthias J. Sax

1 Answers

0
votes

There are multiple reasons for which this happens:

  1. Your broker performance, i.e. how much data your KStream app can pull from each broker
  2. Your KStream performance
  3. Your serialization format (if you use Avro, the data size will be way smaller)

The solution to avoid expensive restarts is to have a persistent local state store. For example, you can map the default state store folder (/tmp/kafka-streams) to some sort of persistent volume