How does stateful operations work in Kafka streams when there are multiple instances of the stream app?

Question

How does the state full operations work in Kafka Stream application with multiple instances? Let's just say that we have 2 topics with 2 partitions each A and B. We have a stream application which consumes from both the topics and there's a join between the 2 streams.

Now we are running 2 instances of this stream application. As far as I understand, each instance will be assigned one of the 2 partitions of each topic.

Now how will the join take place if the messages to be joined are consumed by different instances of the application? I can't get my head around it.

Although I tested it for a small stream application which does seem to be working correctly. Can I always increase the number of instances of any kind of application without ever thinking about the kind of topology defined in the stream application?

Is there any document that I can go through to understand the details of its working?

State is maintained by your Kafka cluster -- to do that it creates any number of internal topics. All the instances of the same application work with the same cluster. — senseiwu

Nishu Tayal Nishu Tayal · Accepted Answer · 2019-01-18T11:19:02

The maximum unit of parallelism is the number of partitions. If you run more instances than number of partitions, excessive instances will be idle.

Joins operation should meet following requirements:

Input data must be co-partitioned while joining. That means, the input topics to be joined should have same number of partitions.
Both topics should have the same partitioning strategy so that records with the same keys can be delivered to the same partitions. If it is different, there are chances to loose the records.

Example : If topic1 has 2 partitions and topic2 has 3 partitions, Join(topic1,topic2) will fail due to unequal partitions. Once you repartition the topic, lets say 3. Now Join(topic1, topic2) will work. You can maximum utilize 3 tasks for this operation. Each partition will maintain its state in the state store, in the form of internal topics. By default, KStream uses RocksDB to store the states.

Here you can see how usually the process works for a stateful transformation:

Please refer these for in-depth details :

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Internal+Data+Management https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#streams-developer-guide-dsl-joins

How does stateful operations work in Kafka streams when there are multiple instances of the stream app?

1 Answers