Cloud ML Engine online prediction performance

Question

I have created a Cloud ML Engine model and tried to generate online/HTTP predictions, but am finding that the latency of running a prediction is still quite high. Below is the Python script I am using to generate predictions (from here):

def predict_json(project, model, instances, version=None):
    service = googleapiclient.discovery.build('ml', 'v1')
    name = 'projects/{}/models/{}'.format(project, model)

    if version is not None:
        name += '/versions/{}'.format(version)

    response = service.projects().predict(
        name=name,
        body={'instances': instances}
    ).execute()

    if 'error' in response:
        raise RuntimeError(response['error'])

    return response['predictions']

When the model is run on my laptop, once I have a tf.Session with the graph and all variables restored, a forward pass through the network takes around 0.16s (for a batch size of 1). However, when I feed in the same data using Cloud ML, a forward pass takes around 3.6s, even when I run the same script multiple times.

I suspect that the model is being re-loaded from scratch every time I attempt to make a prediction - is there a way to have the same tf.Session running in the background so that predictions are generated much faster? Or is there something else I am doing incorrectly?

Thanks in advance for your help!

According to this doc, you can use the default version of the model or you can specify a different version every time. Are you using the same version every time? Check this doc about Managing Models and Jobs. Maybe this general troubleshooting doc is helpful as well - check the way Cloud resources are provisioned for predictions. — Tudormi
Thanks for your reply. The model I am using only has 1 version (which I have set as the default) so that should not be a problem - none of the issues in the troubleshooting doc are applicable either. — yuji

Lak Lak · Accepted Answer · 2018-01-16T05:44:14

Measure the latency between your computer and Google cloud? Try sending a malformed URL and measure response time.
Check the region the service was deployed in.
Send five requests to the service at 30s apart. Does the latency go down?

Cloud ML Engine online prediction performance

1 Answers