2
votes

I'm currently trying to submit a job training on Google Cloud ML with the Facenet (a Tensorflow library for face recognition). I'm currently trying this (link is here) part of the library where it does the training for the model.

Going to Google Cloud ML, I'm following this tutorial (link is here) where it teaches you how to submit a training.

I was able to successfully submit a job training to Google Cloud ML but there was an error. Here are some pictures of the errors:

Here's a picture of the error:

And here's an error from the Google Cloud Jobs logs

Here's a picture of error log in Google Cloud Jobs

Here are more detailed pictures on Google Cloud Job logs

Here are more detailed pictures on Google Cloud Job logs (1)

Here are more detailed pictures on Google Cloud Job logs (2)

Submitting a job request was a success and it was even waiting for Tensorflow to start but right after that there's that error.

The commands I used to run this is here:

gcloud ml-engine jobs submit training facetraining_test4 \
--package-path=/Users/myname/Documents/projects/tf-projects/facenet/src/ \
--module-name=/Users/myname/Documents/projects/tf-projects/facenet/src/facenet_train_classifier.py \
--staging-bucket=gs://facenet-training-test \
--region=asia-east1 \
--config=/Users/myname/Documents/projects/tf-projects/facenet/none_config.yml  \
-- \
--logs_base_dir=/Users/myname/Documents/projects/tf-projects/logs/facenet/ \
--models_base_dir=/Users/myname/Documents/projects/tf-projects/models/facenet/ \
--data_dir=/Users/myname/Documents/projects/tf-projects/facenet_datasets/employee_dataset/employee/employee_maxpy_mtcnnpy_182/ \
--image_size=160 \
--model_def=models.inception_resnet_v1 \
--lfw_dir=/Users/myname/Documents/projects/tf-projects/facenet_datasets/lfw/lfw_mtcnnpy_160/ \
--optimizer=RMSPROP \
--learning_rate -1  \
--max_nrof_epochs=80 \
--keep_probability=0.8 \
--learning_rate_schedule_file=/Users/myname/Documents/projects/tf-projects/facenet/data/learning_rate_schedule_classifier_casia.txt \
--weight_decay=5e-5  \
--center_loss_factor=1e-4  \

Any suggestions on how to fix this? Thanks!

2
If you go to console.cloud.google.com/mlengine/jobs you'll see a list of jobs and a link to the logs. Look for errors there and report back what you find.rhaertel80
Are you sure you are not returning a non-zero status code at the end of your training script? If not, I'd simply add some logging statements to your .py file and then check the logs in console.cloud.google.com/mlengine/jobs for your job to see where it is crashing.Amir Hormati
@rhaertel80 I've added more pictures for a more detailed look at the errors.Mikebarson
@AmirHormati I'm quite new to the Google Cloud ML. I'm having a hard time trying to understand the errors. I've added pictures for a more detailed look at the errors. If you could help me understand the errors that would be great!Mikebarson
@Mikebarson Do you have the refactored version that can run on Cloud ML?wiput1999

2 Answers

1
votes

When you are running on Cloud ML Engine you are running in a remote environment; so the file paths will not be the same as the local environment. If you need to import python modules you need to include them in the Python package you build and then import them using the package name.

For docs on how to build packages please refer to the SetupTools docs

Here's the 30 second version

  1. Organize your code as follows

    my_package/__init__.py
    my_package/moduleA.py
    my_package/moduleB.py
    my_package/...
    setup.py
  1. For your setup.py file start with this
    from setuptools import find_packages
    from setuptools import setup

    REQUIRED_PACKAGES = []

    setup(
        name='my_package',
        version='0.1.1',
        author='Author',
        author_email='[email protected]',
        install_requires=REQUIRED_PACKAGES,
        packages=find_packages(),
        description='Description',
        requires=[],)
  1. Build a package as follows
python ./setup.py sdist
  1. When the package is installed in Cloud ML Engine you will be able to import your code as
from my_package import moduleA
0
votes

Looking at the error message above, it seems that you issue is related to "ImportError: import by filename is not supported" error from Python. Without looking at your python source code, I can't tell you exactly how to fix it but the following link should solve your issue:

Python / ImportError: Import by filename is not supported

In general look for places that you are importing using file paths and make sure you are using the functions correctly.