0
votes

For every spark.streaming.blockInterval (say, 1 minute) receivers listen to streaming sources for data. Suppose the current micro-batch is taking an unnaturally long time to complete (by intention, say 20 min). During this micro-batch, would the Receivers still listens to the streaming source and store it in Spark memory?

The current pipeline runs in Azure Databricks by using Spark Structured Streaming. Can anyone help me understand this!

1

1 Answers

0
votes

With the above scenario the Spark will continue to consume/pull data from Kafka and micro batches will continue to pile up and eventually cause Out of memory (OOM) issues. In order to avoid the scenario enable back pressure setting,

spark.streaming.backpressure.enabled=true

https://spark.apache.org/docs/latest/streaming-programming-guide.html

For more details on Spark back pressure feature