Periodically persist the computed result using Spark Streaming?

Question

I am working on a requirement of a displaying a Real-time dashboard based on the some aggregations computed on the input data.

I have just started to explore Spark/Spark Streaming and I see we can compute in real-time using Spark Integration in micro batches and provide the same to the UI Dashboard.

My query is, if at anytime after the Spark Integration job is started, it is stopped/or crashes and when it comes up how will it resume from the position it was last processing. I understand Spark maintains a internal state and we update that state for every new data we receive. But, wouldn't that state be gone when it is restarted.

I feel we may have to periodically persist the running total/result so as to enable Spark to resume its processing by fetching it from there when it restarted again. But, not sure how I can do that with Spark Streaming .

But, not sure if Spark Streaming by default ensures that the data is not lost,as I have just started using it.

If anyone has faced a similar scenario, can you please provide your thoughts on how I can address this.

I think that you will find some answers here: spark.apache.org/docs/latest/… — maasg

flyhighzy flyhighzy · Accepted Answer · 2017-08-01T13:09:12

Key points:

enable write ahead log for receiver
enable checkpoint

Detail

enable WAL: set spark.streaming.receiver.writeAheadLog.enable true
enable checkpoint

checkpoint is to write your app state to reliable storage periodically. And when your application fails, it can recover from checkpoint file. To write checkpoint, write this:

ssc.checkpoint("checkpoint.path")

To read from checkpoint:

def main(args: Array[String]): Unit = {
    val ssc = StreamingContext.getOrCreate("checkpoint_path", () => createContext())
    
    ssc.start()
    ssc.awaitTermination()
}

in the createContext function, you should create ssc and do your own logic. For example:

def createContext(): StreamingContext = {
  val conf = new SparkConf()
    .setAppName("app.name")
    .set("spark.streaming.stopGracefullyOnShutdown", "true")

  val ssc = new StreamingContext(conf, Seconds("streaming.interval"))
  ssc.checkpoint("checkpoint.path")

  // your code here 

  ssc
}

Here is the document about necessary steps about how to deploy spark streaming applications, including recover from driver/executor failure.

https://spark.apache.org/docs/1.6.1/streaming-programming-guide.html#deploying-applications

Periodically persist the computed result using Spark Streaming?

2 Answers

Key points:

Detail