Is Spark Streaming good for processing randomly timed data?

Question

I'm evaluating Apache Spark and its Spark Streaming component for a backend of a web app & external API. With such a setup, the backend will receive a lot of data that is not uniformly timed: e.g. in response to user actions on the website. Edit: In most cases, I'll want a real-time response for the user. The faster the better, but latency of 1-3 seconds is ok. A sample use case is the user requesting to see a statistical graph in the web UI.

From the documentation, I understand that the fundamental model of Spark Streaming is data processed at equal intervals (the "batch duration"), e.g. of 1 second or so.

The batching that Spark Streaming performs could perhaps be useful to reduce the number of RPCs from my Frontend to my Backend: from one-per-event to one-per-second.

But besides that this architecture seems to have flaws:

Processing events at fixed intervals, and not as they occur.

The slightly unusual control flow in Spark Streaming programs:

doSparkStreaming() {
  <define processing on data; unclear how it gets called later>
  streamingContext.start();
  streamingContext.awaitTermination();
}

Unnecessary polling of data sources during long times of user inactivity.

I'm leaning towards NOT using it. Am I perhaps missing something?

Many thanks.

What usecases are you considering? It all depends on the usecase. — maasg
There is not enough info here to make a decision, all I can tell from this is your use case is getting lots of data that is not uniformly timed. I'm not even sure streaming is the right model for this, IK there is already an answer but I think you could receive a much better one if you gave more details on your use case — aaronman
@aaronman, maasg: Thanks - I added latency requirements (1-3 sec) and use case (interactive web UI) in first paragraph. What do you think? — Jan Żankowski
Do I understand correctly that you want to use spark streaming (or similar) to receive a request for a webpage and deliver data? And what has been called 'not uniformly timed' are those user requests? What is the data being requested? Where is it stored? — maasg
@maasg: Yes, that's right. We realize this is not standard architecture, and we're only considering Spark Streaming because we also have many use cases for periodic computation, without any latency requirements. Plus Spark has a machine learning lib and integration with distributed DBs, which we'll definitely need one way or another. The overall system is for business processes monitoring/analytics. Do you think, based on the requirements stated in my question, that we should look for something else? — Jan Żankowski

WestCoastProjects WestCoastProjects · Accepted Answer · 2014-07-21T12:57:44

Your analysis is on target: Apache Spark in its current form is not likely the best choice. I would recommend instead a queuing system e.g. Kafka or possibly a Cassandra/Storm based approach.

Is Spark Streaming good for processing randomly timed data?

1 Answers