Spark Streaming Re-Use Physical Plan

Question

We have a Spark Streaming application, which performs a few heavy stateful computation against the incoming stream of data. Here the state is maintained in some storage (HDFS/Hive/Hbase/Cassandra) and at the end of every window the delta change in state is updated back using Append Only write strategy.

The issue is that for each and every window the planning phase is taking a lot of time; in-fact more than the compute time.

    dStream.foreachRDD(rdd => {
        val dataset_1 = rdd.toDS()
        val dataset_2 = dataset_1.join(..)
        val dataset_3 = dataset_2
         .map(..)
         .filter(..)
         .join(..)

        // A few more Joins & Transformations

        val finalDataset = ..

        finalDataset
         .write
         .option("maxRecordsPerFile", 5000)
         .format(save_format)
         .mode("append")
         .insertInto("table_name")

    })

Is there a way to re-use the Physical Plan from the last window and avoid planning stages every window by Spark, because practically nothing would have changed between windows.

Jacek Laskowski Jacek Laskowski · Accepted Answer · 2020-05-25T15:35:57

I don't think so and that's one of the many reasons why you should be using Spark Structured Streaming instead.

Among the features of the underlying stream engine is to re-use physical plan of a streaming query.

Spark Streaming Re-Use Physical Plan

1 Answers