0
votes

I am new to Spark and pardon me if this question is too basic. I have a real time scenario where data is continuously pushed to a queue and an analysis needs to run on this data. Spark pulls this data from queue Analysis is multi stage and RDD is iterated over and over again with intermediate updates from every stage and finally we get some mapping which are updated in the RDD itself. Analysis needs to be repeated every n minutes and it should work on the previous final state of RDD + new data . These jobs are always run sequentially and next job never run till previous job is completed.

I can always post data from a run to an external storage or cache and then populate RDD again in next cycle but this will introduce unnecessary overhead and will result in performance impact.

Please suggest the best approach for this scenario.Is cache or persist RDD is the solution? I am not sure how cache/persist work for Spark. Is it local or available to all nodes. Ideal scenario would be when every node retain its chunk of data and for next iteration we have practically no delay in processing.

1

1 Answers

0
votes

It seems you're working with a plain Spark project, getting the information from a queue and updating it. If it is so, a better aproach could be use Spark streaming that automatees this kind of iteration through window operations. Spark streaming also gives you a couple of operations such as updateStateByKey that I think could be useful to you.

In your case, you could define a stream that pulls from your queue, perform some operations in a window and update a state when completed.

Hope that helps you!

Edit

Well, keeping it simple... There are two main scenarios where you can use Spark. In one hand you have batch processes, that works over RDDs. An example could be "I need to get a daily summary of how many people buy from a store by genre". That is what I mean with "plain Spark", the core Spark API.

In the other hand, the nature of the information, and the way you are going to access this information is continuous. For example, "I want show in real time how many people, by genre, come to my store to show it as a gauge".

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams

The first thing you must know is which one is your scenario to choose an aproach. Do you need real time?

Persist and cache don't work as you think. As you know there is two kinds of functions in Spark, transformation and actions. Those methods are used to performing the same operation more than once. As is checkpointing in your code. You should read this.

You could share information between executions storing the resultant RDD as a file in HDFS (i.e.) and loading it as a data source at the beginning of each iteration.

Hope it helps you.