High latency issue of online prediction

Question

I've deployed a linear model for classification on Google Machine Learning Engine and want to predict new data using online prediction.

When I called the APIs using Google API client library, it took around 0.5s to get the response for a request with only one instance. I expected the latency should be less than 10 microseconds (because the model is quite simple) and 0.5s was way too long. I also tried to make predictions for the new data offline using the predict_proba method. It took 8.2s to score more than 100,000 instances, which is much faster than using Google ML engine. Is there a way I can reduce the latency of online prediction? The model and server which sent the request are hosted in the same region.

I want to make predictions in real-time (the response is returned immediately after the APIs gets the request). Is Google ML Engine suitable for this purpose?

In addition to Lak's answer below, it would be helpful if you deployed your model with --enable-logging. The logs contain some per-request latency information (accessible in StackDriver logging). It would be helpful to see those logs. — rhaertel80
We'd love to help you identify the sources of latency. Do you mind sending your project/model/version to cloudml-feedback@. Side note: you can increase the throughput by including many instances per request (current limit is 1.5 Mb payload size). — rhaertel80

Lak Lak · Accepted Answer · 2017-06-20T14:19:41

Some more info would be helpful:

Can you measure the network latency from the machine you are accessing the service to gcp? Latency will be lowest if you are calling from a Compute Engine instance in the same region that you deployed the model to.
Can you post your calling code?
Is this the latency to the first request or to every request?

To answer your final question, yes, cloud ml engine is designed to support a high queries per second.

High latency issue of online prediction

1 Answers