Multiple Spark Kafka consumers with same groupId

Question

I am trying to have multiple consumers for multiple partitions of Kafka topic with same groupId which will help me scale the consumption of messages.

According to Kafka documentation, it says:

If all the consumer instances have the same consumer group, then the records will effectively be load-balanced over the consumer instances.

Having consumers as part of the same consumer group means providing the“competing consumers” pattern with whom the messages from topic partitions are spread across the members of the group. Each consumer receives messages from one or more partitions (“automatically” assigned to it) and the same messages won’t be received by the other consumers (assigned to different partitions). In this way, we can scale the number of consumers up to the number of the partitions (having one consumer reading only one partition);

But when I deploy multiple spark application with same groupId it gives me the following exception:

java.lang.IllegalStateException: Previously tracked partitions [cpq.cluster-1] been revoked by Kafka because of consumer rebalance. This is mostly due to another stream with same group id joined, please check if there're different streaming application misconfigure to use same group id. Fundamentally different stream should use different group id
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:200)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:228)

According to exception, I cannot have multiple consumers with the same groupId. Hence I am unable to have load balancing in my spark application; I can only assign 1 consumer per topic partition and this contradicts the Kafka documentation.

How can I have multiple consumers with the same consumer groupId to have load balancing?

A bit thin on detail here, but I suspect the answer covers the situation at hand. — thebluephantom
@thebluephantom Updated Kafka documentation section. I want to have competing consumers with spark — Anand Jain

moasifk moasifk · Accepted Answer · 2019-10-14T13:13:56

Here, you don't need to execute multiple spark applications to consume from multiple partitions rather single spark application will handle this internally. Spark streaming uses a 1:1 parallelism between Kafka partitions and Spark partitions. If you execute multiple spark applications, it will give this error. Please refer this questions for more details: 2 spark stream job with same consumer group id

Multiple Spark Kafka consumers with same groupId

1 Answers