I have created a Cloud ML Engine model and tried to generate online/HTTP predictions, but am finding that the latency of running a prediction is still quite high. Below is the Python script I am using to generate predictions (from here):
def predict_json(project, model, instances, version=None):
service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}'.format(project, model)
if version is not None:
name += '/versions/{}'.format(version)
response = service.projects().predict(
name=name,
body={'instances': instances}
).execute()
if 'error' in response:
raise RuntimeError(response['error'])
return response['predictions']
When the model is run on my laptop, once I have a tf.Session
with the graph and all variables restored, a forward pass through the network takes around 0.16s (for a batch size of 1). However, when I feed in the same data using Cloud ML, a forward pass takes around 3.6s, even when I run the same script multiple times.
I suspect that the model is being re-loaded from scratch every time I attempt to make a prediction - is there a way to have the same tf.Session
running in the background so that predictions are generated much faster? Or is there something else I am doing incorrectly?
Thanks in advance for your help!