2
votes

If I am correct, by default spark streaming 1.6.1 uses a single thread to read data from each Kafka partition, let assume my Kafka topic partition is 50 and that means messages in each 50 partitions will be read sequentially or may in round robin fashion.

Case 1:

-If yes, then how do I parallelize read operation at partition level? Is creating multiple KafkaUtils.createDirectStream is the only solution?

e.g.
      val stream1 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
        ssc, kafkaParams, topicsSet).map(_._2)

      val stream2 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
        ssc, kafkaParams, topicsSet).map(_._2)

Case 2:

-If my kafka partition is receiving 5 messages/sec then, how does "--conf spark.streaming.kafka.maxRatePerPartition=3" and "--conf spark.streaming.blockInterval" properties comes into picture in such scenario?

2

2 Answers

1
votes

In direct model:

  • each partition is accessed sequentially
  • different partitions are accessed in parallel

In the second case it depends on the interval but in general if maxRatePerPartition is lower than actual rate per second times batch window you'll be always lagging.

1
votes

In case two:

spark.streaming.blockInterval

Only impact receiver, you can see doc:

Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark.


spark.streaming.kafka.maxRatePerPartition = 3 < 5(you say)

Total delay will increase,you can see this

http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-interval