Spark Streaming in Java: Reading from two Kafka Topics using One Consumer using JavaInputDStream

Question

I have a spark application which is required to read from two different topics using one consumer using Spark Java. The kafka message key & value schema is same for both the topics.

Below is the workflow:

1. Read messages from both the topics, same groupID, using JavaInputDStream<ConsumerRecord<String, String>> and iterate using foreachRDD
2. Inside the loop, Read offsets, filter messages based on the message key and create JavaRDD<String>
3. Iterate on JavaRDD<String> using mapPartitions
4. Inside mapPartitions loop, iterate over them using forEachRemaining.
5. Perform data enrichment, transformation, etc on the rows inside forEachRemaining loop.
6. commit

I want to understand below questions. Please provide your answers or share any documentation which can help me find answers.

1. How the messages are received/consumed from two topics(one common group id, same schema both key/value) in one consumer.
Let say the consumer reads data every second. Producer1 produces 50 messages to Topic1 and Producer 2 produces 1000 messages to Topic2.
2. Is it going to read all msgs(1000+50) in one batch and process together in the workflow, OR is it going to read 50 msgs first, process them and then read 1000 msgs and process them.
3. What parameter should i use to control the number of messages being read in one batch per second.
4. Will same group id create any issue while consuming.

Gokulraj Gokulraj · Accepted Answer · 2020-03-24T06:15:35

The official document in Spark Streaming already explains on how to consume multiple topics per group id. https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

Collection<String> topics = Arrays.asList("topicA", "topicB");

JavaInputDStream<ConsumerRecord<String, String>> stream =
  KafkaUtils.createDirectStream(
    streamingContext,
    LocationStrategies.PreferConsistent(),
    ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
  );

One group id and follows same schema for both the topics.
Not sure about this, however from my understanding it would consume all the messages depending on the batch size.
"spark.streaming.backpressure.enabled" set this as true and "spark.streaming.kafka.maxRatePerPartition" set this as a numeric value, based on this spark limits the number of messaged to consume from kafka per batch. Also set the batch duration accordingly. https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html
This totally depends on your application usage.

Spark Streaming in Java: Reading from two Kafka Topics using One Consumer using JavaInputDStream

2 Answers