I am new to Spark and pardon me if this question is too basic. I have a real time scenario where data is continuously pushed to a queue and an analysis needs to run on this data. Spark pulls this data from queue Analysis is multi stage and RDD is iterated over and over again with intermediate updates from every stage and finally we get some mapping which are updated in the RDD itself. Analysis needs to be repeated every n minutes and it should work on the previous final state of RDD + new data . These jobs are always run sequentially and next job never run till previous job is completed.
I can always post data from a run to an external storage or cache and then populate RDD again in next cycle but this will introduce unnecessary overhead and will result in performance impact.
Please suggest the best approach for this scenario.Is cache or persist RDD is the solution? I am not sure how cache/persist work for Spark. Is it local or available to all nodes. Ideal scenario would be when every node retain its chunk of data and for next iteration we have practically no delay in processing.