Kafka consumer group and partitions with Spark structured streaming

Question

I have a Kafka topic with 3 partitions and I'm consuming that data using spark structured streaming. I have 3 consumers (lets say consumer group A) reading from single partition each, everything is working file till here.

I have a new requirement to read from the same topic and I want to parallelize it by creating 3 consumers (say consumer group B) again each reading from single partition. As I'm using structured streaming I can't mention group.id explicitly.

Will consumers from different group pointing to single/same partition read all the data ?

I don't know how spark works with this, but if the question is if the read is independent between groups, yes; You'll have two consumers for each partition, each one with its own group id, reading all the messages independently — aran

Pardeep Pardeep · Accepted Answer · 2020-11-21T17:40:35

From Spark 3.0.1 documentation:

By default, each query generates a unique group id for reading data. This ensures that each Kafka source has its own consumer group that does not face interference from any other consumer, and therefore can read all of the partitions of its subscribed topics.

So, if you are using assign option and mentioning which partition to use it will read all data from a specific partition as by it's default nature it will be a different consumer group (group.id). assign option takes json string as a value and can have multiple partitions from different topics as well. For e.g., {"topicA":[0,1],"topicB":[2,4]}.

val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("assign", "{"topic-name":[0]}")
  .load()

Kafka consumer group and partitions with Spark structured streaming

3 Answers