How to read concurrently from each Kafka partition in Spark Streaming DirectAPI

Question

If I am correct, by default spark streaming 1.6.1 uses a single thread to read data from each Kafka partition, let assume my Kafka topic partition is 50 and that means messages in each 50 partitions will be read sequentially or may in round robin fashion.

Case 1:

-If yes, then how do I parallelize read operation at partition level? Is creating multiple KafkaUtils.createDirectStream is the only solution?

e.g.
      val stream1 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
        ssc, kafkaParams, topicsSet).map(_._2)

      val stream2 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
        ssc, kafkaParams, topicsSet).map(_._2)

Case 2:

-If my kafka partition is receiving 5 messages/sec then, how does "--conf spark.streaming.kafka.maxRatePerPartition=3" and "--conf spark.streaming.blockInterval" properties comes into picture in such scenario?

Unknown Unknown · Accepted Answer · 2016-12-12T18:49:01

In direct model:

each partition is accessed sequentially
different partitions are accessed in parallel

In the second case it depends on the interval but in general if maxRatePerPartition is lower than actual rate per second times batch window you'll be always lagging.

How to read concurrently from each Kafka partition in Spark Streaming DirectAPI

2 Answers