We have a Spark Streaming application, which performs a few heavy stateful computation against the incoming stream of data. Here the state is maintained in some storage (HDFS/Hive/Hbase/Cassandra) and at the end of every window the delta change in state is updated back using Append Only write strategy.
The issue is that for each and every window the planning phase is taking a lot of time; in-fact more than the compute time.
dStream.foreachRDD(rdd => {
val dataset_1 = rdd.toDS()
val dataset_2 = dataset_1.join(..)
val dataset_3 = dataset_2
.map(..)
.filter(..)
.join(..)
// A few more Joins & Transformations
val finalDataset = ..
finalDataset
.write
.option("maxRecordsPerFile", 5000)
.format(save_format)
.mode("append")
.insertInto("table_name")
})
Is there a way to re-use the Physical Plan from the last window and avoid planning stages every window by Spark, because practically nothing would have changed between windows.