How can Kafka Streams perform a join when the data to join could be allocated on different machines?

Question

Having two Kafka topics with two partitions each. Their messages are keyed by the same param id: Integer.

I have two instances of a Kafka Streams application, so each of them would be assigned two partitions (tasks) one per topic.

Now, imagine that the partition having message ids =1 from topic A is assigned to the KStreams app instance A and the partition with message ids =1 from topic B is assigned to app instance B, how can a join of those two KStreams ever work if the data from the topics may not be collocated ( as would happen in this example for keys/ids=1)?

Compare: stackoverflow.com/questions/47104887/… -- it works the same way for aggregation and joins. Data will be repartitioned. — Matthias J. Sax

Aman Juneja Aman Juneja · Accepted Answer · 2018-08-19T13:17:04

There are ways to do it... if storage is not an issue or frequency if messages are low then you can use the GlobalKtables for one of the topic. It will cost more memory as all the partitions will be synced on all instances of Streams app.

https://docs.confluent.io/current/streams/concepts.html#globalktable

Other way is to use the Kafka streams interactive queries to discover the data on other stream instances.

https://kafka.apache.org/10/documentation/streams/developer-guide/interactive-queries.html

For KStreams joins - you need to have same number of partitions for both the topics as well as same partitioning strategy. In that way all consumers will read the partitions for both topic in same way.

nice reference Blog for partitioning - https://medium.com/@anyili0928/what-i-have-learned-from-kafka-partition-assignment-strategy-799fdf15d3ab

How can Kafka Streams perform a join when the data to join could be allocated on different machines?

1 Answers