I am working on a spark streaming application using spark structured streaming, confluent open source Kafka cluster and running spark job in AWS EMR. We have at-least 20+ Kafka topics producing data into individual Kafka topics in AVRO format and each topic has been partitioned between 3 to 4 partitions. I am reading all 20+ topics (comma separated topic values) using Spark ReadStream. Then filtering each message row from a resulting DataFrame, applying each message with correct Avro schema and writing the resulting Dataset[T] to S3 and Cassandra.
Few questions that I couldn't find answers for
Am I ok to use one
ReadStreamfor all the topics? Will it be considered as one Spark Consumer for all the topics and partitions because I am only executing one 'spark-submit job' in EMR?How does the spark application distribute processing across the partitions? Does spark read these topic/partitions in parallel using different executors or do I need to implement any multi-threading for each each partition?
Is it possible to scale to multiple consumers within a consumer group to parallelise?
Apologies for loads of questions and I think they all related. Appreciate any of your feedback or pointers where I can find documentation.
MyConfig
val kafkaParams= Map("kafka.bootstrap.servers" -> "topic1,topic2,topic3,topic4,topic5, "failOnDataLoss" -> param.fail_on_data_loss.toString, "subscribe" -> param.topics.toString, "startingOffsets" -> param.starting_offsets.toString, "kafka.security.protocol" -> param.kafka_security_protocol.toString, "kafka.ssl.truststore.location" -> param.kafka_ssl_truststore_location.toString, "kafka.ssl.truststore.password" -> param.kafka_ssl_truststore_password.toString )ReadStream code
val df = sparkSession.readStream .format("kafka") .options(kafkaParams) .load()Then splitting Input Dataframe into multiple data frames using the 'topic column, and apply the Avro schema for each resulting dataframe.
Writing each
Dataset[T]to different sinks like S3, Cassandra, etc...