1
votes

I'm trying to train a keras model on the google cloud ml. I've followed every instruction from here: https://github.com/clintonreece/keras-cloud-ml-engine When I try to run it locally I get the ImportErrors for scikit-learn and when I try to run it on cloud, the job fails. I don't think the setup.py file is getting executed. Here's the contents of the setup.py file:

'''Cloud ML Engine package configuration.'''
from setuptools import setup, find_packages

REQUIRED_PACKAGES = ['keras',
                     'pandas',
                     'sklearn',
                     'numpy',
                     'h5py']

setup(name='iris_classifier',
      version='1.0',
      packages=find_packages(),
      include_package_data=True,
      description='IRIS classifier keras model on Cloud ML Engine',
      author='Loonycorn',
      author_email='[email protected]',
      license='MIT',
      install_requires=[REQUIRED_PACKAGES],
      zip_safe=False)

Why are my packages not getting insatlled?

Here's the command for training:

gcloud ml-engine jobs submit training $JOB_NAME \
    --job-dir $JOB_DIR \
    --runtime-version 1.0 \
    --module-name trainer.iris_classifier \
    --package-path ./trainer \
    --region $REGION \
    -- \
    --train-file gs://$BUCKET_NAME/data/iris.csv

Setup.py resides in the root directory with the data folder(which contains the csv) and the trainer folder(which contains the iris_classifier.py and init.py files).

Here's the error when the job fails:

{
 insertId:  "2sbguefffpjr1"  
 logName:  "projects/loonycorn-kerasdeployment/logs/ml.googleapis.com%2Firis_classifier_train_220180511_200703"  
 receiveTimestamp:  "2018-05-11T14:38:18.976968299Z"  
 resource: {…}  
 severity:  "ERROR"  
 textPayload:  "The replica master 0 exited with a non-zero status of 1."  
 timestamp:  "2018-05-11T14:38:18.976968299Z"  
}

I've given logs writer permission to the cloud-ml service accoutn, still this is all the logs I get.

1
Where does the setup.py reside? What is the error message when the job fails? (please update your post with that info). Also, note that when running locally, setup.py is not run -- the local run depends on packages installed on your system. - rhaertel80
I've made the edits. - Judy T Raj

1 Answers

3
votes

Used the gcloud ml-engine jobs stream-logs $JOB_NAME command to stream the logs. Apparently setup.py was getting executed but as I hadn't mentioned the versions of the dependencies to be installed the verison of scikit learn that was being installed didn't have the model_selection module. Once I edited setup.py with the versions of all the dependencies, everything worked. I recommend explicitly setting the versions of dependencies to be installed as the ones on your local machine, so you know eveything is supported and included.