I am working on a task where I am using pyspark/python to read events from event hub. When I have multiple consumer groups, I am getting duplicate messages which is behavior. Eg:I have 2 consumer groups(CG) and 2 events. CG1 consuming event1 and while this process is ON the 2nd event got triggered then CG2 will consume which is good but now after CG1 is free after event1 consumption its consuming event2 aswell which we want to avoid. Even though the checkpoint is available, its failing. is this default behaviour?
1 Answers
According to you, you added multiple consumer groups in order to handle lots of messages given your comment:
Q: Why did you choose to use multiple consumer groups anyway?
A: There are good number of messages which flows in so we added two.
Scaling out is done using partitions, not using consumer groups. They are designed to be independent. You can't work against that.
Your question:
I have 2 consumer groups(CG) and 2 events. CG1 consuming event1 and while this process is ON the 2nd event got triggered then CG2 will consume which is good but now after CG1 is free after event1 consumption its consuming event2 aswell which we want to avoid. Even though the checkpoint is available, its failing. is this default behaviour?
The answer is yes, this is default behaviour. A consumer group is a seperate view of the whole message stream. Each consumer group has its own offset (checkpoint) of where they are in terms of messages they have processed of that stream. That means that each and every message will be received by each and every consumer group.
From the docs:
Consumer groups: A view (state, position, or offset) of an entire event hub. Consumer groups enable consuming applications to each have a separate view of the event stream. They read the stream independently at their own pace and with their own offsets.
This picture of the architecture also shows how the messages flow through all consumer groups.
See also this answer that provides more details about consumer groups.
Again, if you want to scale, do not use consumer groups but tweak your provisioned throughput units, partitions or improve the processing logic. See the docs about scalability.