How is spark.streaming.blockInterval related to RDD partitions?

Question

What is the difference between blocks in spark.streaming.blockInterval and RDD partitions in Spark Streaming?

Quoting Spark Streaming 2.2.0 documentation:

For most receivers, the received data is coalesced together into blocks of data before storing inside Spark’s memory. The number of blocks in each batch determines the number of tasks that will be used to process the received data in a map-like transformation.

Number of blocks are determined according to block interval. And also we can define number of rdd partitions. So as I think, they cannot be same. What is the different between them?

phoenix phoenix · Accepted Answer · 2018-02-05T01:12:20

spark.streaming.blockInterval: Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark. This is when using receiver bases approach - Receiver-based Approach

And KafkaUtils.createDirectStream() do not use receiver, hence with DStream API, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume. - Direct Approach (No Receivers)

That means block interval configuration is of no use in DStream API.

How is spark.streaming.blockInterval related to RDD partitions?

1 Answers