Spark streaming applications subscribing to same kafka topic

Question

I am new to spark and kafka and I have a slightly different usage pattern of spark streaming with kafka. I am using

spark-core_2.10 - 2.1.1
spark-streaming_2.10 - 2.1.1
spark-streaming-kafka-0-10_2.10 - 2.0.0
kafka_2.10 - 0.10.1.1

Continuous event data is being streamed to a kafka topic which I need to process from multiple spark streaming applications. But when I run the spark streaming apps, only one of them receives the data.

     Map<String, Object> kafkaParams = new HashMap<String, Object>();

     kafkaParams.put("bootstrap.servers", "localhost:9092");
     kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
     kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); 
     kafkaParams.put("auto.offset.reset", "latest");
     kafkaParams.put("group.id", "test-consumer-group");
     kafkaParams.put("enable.auto.commit", "true");
     kafkaParams.put("auto.commit.interval.ms", "1000");
     kafkaParams.put("session.timeout.ms", "30000");

     Collection<String> topics =  Arrays.asList("4908100105999_000005");;
     JavaInputDStream<ConsumerRecord<String, String>> stream =  org.apache.spark.streaming.kafka010.KafkaUtils.createDirectStream(
                    ssc,
                    LocationStrategies.PreferConsistent(),
                    ConsumerStrategies.<String, String> Subscribe(topics, kafkaParams) );

      ... //spark processing

I have two spark streaming applications, usually the first one I submit consumes the kafka messages. Second application just waits for messages and never proceeds. As I read, kafka topics can be subscribed from multiple consumers, is it not true for spark streaming ? Or there is something I am missing with kafka topic and its configuration ?

Thanks in advance .

Sachin Thapa Sachin Thapa · Accepted Answer · 2017-08-29T18:14:00

You can create different streams with same groupids. Here are more details from the online documentation for 0.8 integrations, there are two approaches:

Approach 1: Receiver-based Approach

Multiple Kafka input DStreams can be created with different groups and topics for parallel receiving of data using multiple receivers.

Approach 2: Direct Approach (No Receivers)

No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD partitions, which is easier to understand and tune.

You can read more at Spark Streaming + Kafka Integration Guide 0.8

From your code looks like you are using 0.10, refer Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0

Even thought it is using spark streaming api, everything is controlled by kafka properties so depends on group id you specify in properties file, you can start multiple streams with different group id's.

Cheers !

Spark streaming applications subscribing to same kafka topic

2 Answers