Error when Submitting Job Training in Google Cloud ML

Question

I'm currently trying to submit a job training on Google Cloud ML with the Facenet (a Tensorflow library for face recognition). I'm currently trying this (link is here) part of the library where it does the training for the model.

Going to Google Cloud ML, I'm following this tutorial (link is here) where it teaches you how to submit a training.

I was able to successfully submit a job training to Google Cloud ML but there was an error. Here are some pictures of the errors:

And here's an error from the Google Cloud Jobs logs

Here are more detailed pictures on Google Cloud Job logs

Submitting a job request was a success and it was even waiting for Tensorflow to start but right after that there's that error.

The commands I used to run this is here:

gcloud ml-engine jobs submit training facetraining_test4 \
--package-path=/Users/myname/Documents/projects/tf-projects/facenet/src/ \
--module-name=/Users/myname/Documents/projects/tf-projects/facenet/src/facenet_train_classifier.py \
--staging-bucket=gs://facenet-training-test \
--region=asia-east1 \
--config=/Users/myname/Documents/projects/tf-projects/facenet/none_config.yml  \
-- \
--logs_base_dir=/Users/myname/Documents/projects/tf-projects/logs/facenet/ \
--models_base_dir=/Users/myname/Documents/projects/tf-projects/models/facenet/ \
--data_dir=/Users/myname/Documents/projects/tf-projects/facenet_datasets/employee_dataset/employee/employee_maxpy_mtcnnpy_182/ \
--image_size=160 \
--model_def=models.inception_resnet_v1 \
--lfw_dir=/Users/myname/Documents/projects/tf-projects/facenet_datasets/lfw/lfw_mtcnnpy_160/ \
--optimizer=RMSPROP \
--learning_rate -1  \
--max_nrof_epochs=80 \
--keep_probability=0.8 \
--learning_rate_schedule_file=/Users/myname/Documents/projects/tf-projects/facenet/data/learning_rate_schedule_classifier_casia.txt \
--weight_decay=5e-5  \
--center_loss_factor=1e-4  \

Any suggestions on how to fix this? Thanks!

If you go to console.cloud.google.com/mlengine/jobs you'll see a list of jobs and a link to the logs. Look for errors there and report back what you find. — rhaertel80
Are you sure you are not returning a non-zero status code at the end of your training script? If not, I'd simply add some logging statements to your .py file and then check the logs in console.cloud.google.com/mlengine/jobs for your job to see where it is crashing. — Amir Hormati
@rhaertel80 I've added more pictures for a more detailed look at the errors. — Mikebarson
@AmirHormati I'm quite new to the Google Cloud ML. I'm having a hard time trying to understand the errors. I've added pictures for a more detailed look at the errors. If you could help me understand the errors that would be great! — Mikebarson
@Mikebarson Do you have the refactored version that can run on Cloud ML? — wiput1999

Jeremy Lewi Jeremy Lewi · Accepted Answer · 2017-03-09T15:31:47

When you are running on Cloud ML Engine you are running in a remote environment; so the file paths will not be the same as the local environment. If you need to import python modules you need to include them in the Python package you build and then import them using the package name.

For docs on how to build packages please refer to the SetupTools docs

Here's the 30 second version

Organize your code as follows


    my_package/__init__.py
    my_package/moduleA.py
    my_package/moduleB.py
    my_package/...
    setup.py

For your setup.py file start with this

    from setuptools import find_packages
    from setuptools import setup

    REQUIRED_PACKAGES = []

    setup(
        name='my_package',
        version='0.1.1',
        author='Author',
        author_email='[email protected]',
        install_requires=REQUIRED_PACKAGES,
        packages=find_packages(),
        description='Description',
        requires=[],)

Build a package as follows

python ./setup.py sdist

When the package is installed in Cloud ML Engine you will be able to import your code as

from my_package import moduleA

Error when Submitting Job Training in Google Cloud ML

2 Answers