0
votes

When building a Kafka Streams topology, reads from multiple topics can be modeled in two different ways:

  1. Read all topics with the same source node.

topologyBuilder.addSource("sourceName", ..., "topic1", "topic2", "topic3");

  1. Read each topic using a separate source node.
topologyBuilder.addSource("sourceName1", ..., "topic1")
               .addSource("sourceName2", ..., "topic2")
               .addSource("sourceName3", ..., "topic3");

Is there a relative advantage of option1 over option2 or vice versa? All topics contain the same type of data and have the same data processing logic.

2

2 Answers

2
votes

Given that, as you state, all input topics contain the same kind of data and subsequent processing of the data is equivalent, you should most probably go with option 1, for the following two reasons:

1) this will result in a smaller topology

2) you would only need to connect one source node to your subsequent processing steps

In case processing will need to be different for the different source topics at a later point in time, you could then split up the source node into multiple ones.

2
votes

There are several other factors to consider.

If your input data is uniformly distributed between input topics (by the size and the rate of messages), then go for option 1, because of its simplicity. If not, then the "slow" topics will slow down your overall consumption, so to achieve smaller delays on "fast" topics go for option 2.

If you run several such topologies in parallel on different nodes (for high availability or high throughput), then having one consumer group (option 1) will result in more consumers to coordinate within it. In my experience this also slows down consumption, especially when you restart consumers (or if they fall out). In this case I also go for option 2: less consumers in a group require less effort to coordinate, shorter delays.