4
votes

We are trying to find a way to load a Spark (2.x) ML trained model so that on request (through a REST interface) we can query it and get the predictions, e.g. http://predictor.com:8080/give/me/predictions?a=1,b=2,c=3

There are libs out-of-box to load a model into Spark (given it was stored somewhere after training using MLWritable) and then use it for predictions, but it seems overkill to wrap it in a job and run this per request/call due to SparkContext's initialization.

However, using Spark has the advantage that we can save our Pipeline model and perform the same feature transformations without having to implement it outside of the SparkContext.

After some digging, we found that spark-job-server can potentially help us with this problem by allowing us to have a "hot" spark-context initialized for the job-server and hence, we can then serve the requests by calling the prediction job (and getting the results) within the existent context using the spark-job-server's REST API.

Is this the best approach to API-ify a prediction? Due to the feature space we cannot pre-predict all combinations.

Alternatively we were thinking about using Spark Streaming and persisting the predictions to a message queue. This allows us to not use spark-job-server but it doesn't simplify the overall flow. Has anyone tried a similar approach?

2
We've recently tried to use jobserver to solve a similar problem of executing Spark jobs on demand. Although it is nice, it is far from being a production grade ready to ship product. You have to do a lot of tweaks manually, support for Spark 2.x is in preview, and deploying it requires work. If you're ready to put in substantial amount of work, go ahead. We ended going with a solution based on Sparks undocumented REST API.Yuval Itzchakov
Would it even respond in decent time (sub 0.1 sec) ? In my experience, ML pipelines are really slow, due to the various steps in their computation, like converting schemas, type checks, and most importantly some kind of model/matrix broadcast at least on NaiveBayes, W2V and some others I have used. (the cost is amortized when you have tons of predictions to make, but their setup is prohibitive in a single prediction case). In any way, I don't see spark ML pipes performing anywhere near sub second. Have you achieved otherwise ?GPI

2 Answers

1
votes

Another option could be cloudera's livy (http://livy.io/ | https://github.com/cloudera/livy#rest-api) which allows for session caching, interactive queries, batch jobs and more. I've used it and found it very promising.

1
votes

Scenarios

  • Prediction using Spark Streaming - Supports real-time scoring/prediction but requires a stream based flow (Push model).

    • Pipeline: score/predict on stream -> store in a realtime store -> serve data via REST or plug analytical toolkit for dashboards
    • Pros: Realtime scoring, Realtime dashboarding capability if used with stream-write supported stores (e.g. Druid).
    • Cons: Scores all events, Storage bloats, Further data archival strategy required to keep dashboarding lightweight
  • REST based predictions (PredictionIO, SJS) - Supports interaction-level scoring (Request-Response model).

    • Pipeline: deploy trained model in a SparkContext -> Predict and return scores based on REST request
    • Pros: Supports interactive-scoring, Selective event scoring, Make WebApp's interactions Intelligent
    • Cons: Less performant than stream-based scoring, Higher Memory footprint, Additional framework required with Apache Spark

The Answer

It depends on the usecase. If you are having stream-based dataflow and require to score all events, utilize Spark Streaming based pipeline. Real-life example would be to score incoming Financial Transactions for Fraud Detection. If your requirement is interaction-based, go with REST based scoring. Example, recommend similar items/products to a User based on User's interaction on the website/app.