Using Apache Spark to serve real time web services queries

Question

We have a use case, where we are downloading large volumes (order of 100 gigabytes per day) of data from hundreds of data sources, massaging and processing this data and then exposing this data to our customers via RESTful API. Today the base data size is ca. 20TB and expected to grow heavily in the future.

For the massaging/processing part, we believe spark can be a very good choice for us. Now for exposing processed/massaged data through an API, one option is to store processed data to a read only database like ElephantDB and make web services to talk to ElephantDB (at least this is how Nathan has proposed in his Big Data book). I was just wondering what would be the implication of we make web services implementation to use SparkSQL to access processed data from Spark. What could be the architecture/design dangers in this case?

Every body is talking about Spark is fast and what not and using SparkSQL for interactive queries. But is it already in a stage to serve large volume of web services queries via SparkSQL where we have very strict SLA for latency serve hundreds and thousands of web services requests per second? If Apache Spark could handle this, we could avoid maintaining yet another system like ElephantDB or Cassandra or what not.

Would like to hear from the experts on this board.

Marius Soutier Marius Soutier · Accepted Answer · 2015-06-04T20:52:40

If the results are stored in files, you have no indexes, and SparkSQL also doesn't create indexes. The only thing that can be somewhat fast is reading columns from Parquet files and caching tables.

But in general it's not a good use case to use SparkSQL to serve web requests simply because Spark wasn't made for that.

Using Apache Spark to serve real time web services queries

2 Answers