1
votes

I have a consumer that is supposed to read messages from a topic. This consumer actually reads the messages and writes them to a time series database. We have multiple instances of the time series database running as a cluster on multiple physical machines.

Our plan is to deploy the consumer on all those machines where the time series service is running. So if I have 5 nodes on which the time series service is running, I will install one consumer instance per node. All those consumer instances belong to the same consumer group. So in pictures the set up looks like below:

enter image description here

As you can see, the Producer P1 and P2 write into 2 partitions namely partition 1 and partition 2 of the kafka topic. I then have 4 instances of the time series service where one consumer is running per instance. How should I read using my consumer properly such that I do not end up with duplicate messages in my time series database?

Edit: After reading through the Kafka documentation, I came across these two statements:

If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.

If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.

So in my case above, it is behaving like a Queue? Is my understanding correct?

1

1 Answers

2
votes

If all consumers belong to one group (have the same groupId), then kafka topic will behave for you as a queue.

Important: there is no reason to have consumers more than partitions, as consumers (out-of-the-box kafka consumers) are scaled by partitions.

http://kafka.apache.org/images/consumer-groups.png