I am working on a requirement of a displaying a Real-time dashboard based on the some aggregations computed on the input data.
I have just started to explore Spark/Spark Streaming and I see we can compute in real-time using Spark Integration in micro batches and provide the same to the UI Dashboard.
My query is, if at anytime after the Spark Integration job is started, it is stopped/or crashes and when it comes up how will it resume from the position it was last processing. I understand Spark maintains a internal state and we update that state for every new data we receive. But, wouldn't that state be gone when it is restarted.
I feel we may have to periodically persist the running total/result so as to enable Spark to resume its processing by fetching it from there when it restarted again. But, not sure how I can do that with Spark Streaming .
But, not sure if Spark Streaming by default ensures that the data is not lost,as I have just started using it.
If anyone has faced a similar scenario, can you please provide your thoughts on how I can address this.