2
votes

I have created a Cloud ML Engine model and tried to generate online/HTTP predictions, but am finding that the latency of running a prediction is still quite high. Below is the Python script I am using to generate predictions (from here):

def predict_json(project, model, instances, version=None):
    service = googleapiclient.discovery.build('ml', 'v1')
    name = 'projects/{}/models/{}'.format(project, model)

    if version is not None:
        name += '/versions/{}'.format(version)

    response = service.projects().predict(
        name=name,
        body={'instances': instances}
    ).execute()

    if 'error' in response:
        raise RuntimeError(response['error'])

    return response['predictions']

When the model is run on my laptop, once I have a tf.Session with the graph and all variables restored, a forward pass through the network takes around 0.16s (for a batch size of 1). However, when I feed in the same data using Cloud ML, a forward pass takes around 3.6s, even when I run the same script multiple times.

I suspect that the model is being re-loaded from scratch every time I attempt to make a prediction - is there a way to have the same tf.Session running in the background so that predictions are generated much faster? Or is there something else I am doing incorrectly?

Thanks in advance for your help!

1
According to this doc, you can use the default version of the model or you can specify a different version every time. Are you using the same version every time? Check this doc about Managing Models and Jobs. Maybe this general troubleshooting doc is helpful as well - check the way Cloud resources are provisioned for predictions.Tudormi
Thanks for your reply. The model I am using only has 1 version (which I have set as the default) so that should not be a problem - none of the issues in the troubleshooting doc are applicable either.yuji

1 Answers

0
votes
  1. Measure the latency between your computer and Google cloud? Try sending a malformed URL and measure response time.

  2. Check the region the service was deployed in.

  3. Send five requests to the service at 30s apart. Does the latency go down?