Can't train my Tensorflow detector model in Google Cloud

Question

I'm trying to train my own Detector model based on Tensorflow sample and this post. And I did succeed on training locally on a Macbook Pro. The problem is that I don't have a GPU and doing it on the CPU is too slow (about 25s per iteration).

This way, I'm trying to run on Google Cloud ML Engine following the tutorial, but I can't make it run properly.

My folder structures is described below:

+ data
 - train.record
 - test.record
+ models
 + train
 + eval
+ training
 - ssd_mobilenet_v1_coco

My steps to change from local training to Google Cloud training were:

Create a bucket in Google Cloud storage and copy my local folder structure with files;
Edit my pipeline.config file and change all paths from Users/dev/detector/ to gcc://bucketname/;
Create a YAML file with the default configuration provided in the tutorial;
Run

gcloud ml-engine jobs submit training object_detection_date +%s \ --job-dir=gs://bucketname/models/train \ --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \ --module-name object_detection.train \ --region us-east1 \ --config /Users/dev/detector/training/cloud.yml \ -- \ --train_dir=gs://bucketname/models/train \ --pipeline_config_path=gs://bucketname/data/pipeline.config

Doing so, gives me the following error message from the MLUnits:

The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 49, in from object_detection import trainer File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 27, in from object_detection.builders import preprocessor_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/preprocessor_builder.py", line 21, in from object_detection.protos import preprocessor_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/preprocessor_pb2.py", line 71, in options=None, file=DESCRIPTOR), TypeError: __new__() got an unexpected keyword argument 'file'

Thanks in advance.

Got this suggestion on Github. Still didn't have time to test it. — Minoru

eranh eranh · Accepted Answer · 2017-12-28T14:04:29

Check the solution posted here by andersskog. It worked for me. I made a patch here. For manual fix follow the below instructions:

Make sure your yaml version is 1.4, eg:

trainingInput:
  runtimeVersion: "1.4"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

Change setup.py to the following:

"""Setup script for object_detection."""

import logging
import subprocess
from setuptools import find_packages
from setuptools import setup
from setuptools.command.install import install

class CustomCommands(install):

    def RunCustomCommand(self, command_list):
        p = subprocess.Popen(
        command_list,
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT)
        stdout_data, _ = p.communicate()
        logging.info('Log command output: %s', stdout_data)
        if p.returncode != 0:
            raise RuntimeError('Command %s failed: exit code: %s' %
                         (command_list, p.returncode))

    def run(self):
        self.RunCustomCommand(['apt-get', 'update'])
        self.RunCustomCommand(
          ['apt-get', 'install', '-y', 'python-tk'])
        install.run(self)

REQUIRED_PACKAGES = ['Pillow>=1.0', 'protobuf>=3.3.0', 'Matplotlib>=2.1']

setup(
    name='object_detection',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    include_package_data=True,
    packages=[p for p in find_packages() if p.startswith('object_detection')],
    description='Tensorflow Object Detection Library',
 cmdclass={
        'install': CustomCommands,
    }
)

In object_detection/utils/visualization_utils.py, line 24 (before import matplotlib.pyplot as plt) add:

import matplotlib
matplotlib.use('agg')

In line 184 of object_detection/evaluator.py, change

tf.train.get_or_create_global_step()

to

tf.contrib.framework.get_or_create_global_step()

Finally, in line 103 of object_detection/builders/optimizer_builder.py, change

tf.train.get_or_create_global_step()

to

tf.contrib.framework.get_or_create_global_step()

Hope this helps!

Can't train my Tensorflow detector model in Google Cloud

2 Answers