2
votes

I have a Spark cluster with 17 executors in total. I have integrated Spark 2.1 with Kafka and reading the data from topic like:

val df = spark
  .readStream
  .format("kafka")
  .options("kafka.bootstrap.servers","localhost:9092")
  .options("subscribe","test")
  .load 

Now I want to know that when I'll Submit my spark application in cluster mode how many executors (out of the total 17) will be assigned to listen to a Kafka topic and creating micro-batches in Structured Streaming.

Also, how can I limit the size of a micro-batch in Structured Streaming when reading from Kafka?

1

1 Answers

2
votes

Structured Steaming uses a single partition per Kafka topic partition. Since a single partition is processed by a single core, it will use at most this number of executors, from the ones assigned to the application.

Number of messages processed in a batch depends primarily on the trigger used (and as a consequence batch interval, if batching is used at all) however take a look at maxOffsetsPerTrigger:

Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume.