I'm evaluating Apache Spark and its Spark Streaming component for a backend of a web app & external API. With such a setup, the backend will receive a lot of data that is not uniformly timed: e.g. in response to user actions on the website. Edit: In most cases, I'll want a real-time response for the user. The faster the better, but latency of 1-3 seconds is ok. A sample use case is the user requesting to see a statistical graph in the web UI.
From the documentation, I understand that the fundamental model of Spark Streaming is data processed at equal intervals (the "batch duration"), e.g. of 1 second or so.
The batching that Spark Streaming performs could perhaps be useful to reduce the number of RPCs from my Frontend to my Backend: from one-per-event to one-per-second.
But besides that this architecture seems to have flaws:
- Processing events at fixed intervals, and not as they occur.
- The slightly unusual control flow in Spark Streaming programs:
doSparkStreaming() { <define processing on data; unclear how it gets called later> streamingContext.start(); streamingContext.awaitTermination(); } - Unnecessary polling of data sources during long times of user inactivity.
I'm leaning towards NOT using it. Am I perhaps missing something?
Many thanks.