3
votes

I'm using open source Tensorflow implementations of research papers, for example DCGAN-tensorflow. Most of the libraries I'm using are configured to train the model locally, but I want to use Google Cloud ML to train the model since I don't have a GPU on my laptop. I'm finding it difficult to change the code to support GCS buckets. At the moment, I'm saving my logs and models to /tmp and then running a 'gsutil' command to copy the directory to gs://my-bucket at the end of training (example here). If I try saving the model directly to gs://my-bucket it never shows up.

As for training data, one of the tensorflow samples copies data from GCS to /tmp for training (example here), but this only works when the dataset is small. I want to use celebA, and it is too large to copy to /tmp every run. Is there any documentation or guides for how to go about updating code that trains locally to use Google Cloud ML?

The implementations are running various versions of Tensorflow, mainly .11 and .12

1

1 Answers

11
votes

There is currently no definitive guide. The basic idea would be to replace all occurrences of native Python file operations with equivalents in the file_io module, most notably:

These functions will work locally and on GCS (as well as any registered file system). Note, however, that there are some slight differences in file_io and the standard file operations (e.g., a different set of 'modes' are supported).

Fortunately, checkpoint and summary writing do work out of the box, just be sure to pass a GCS path to tf.train.Saver.save and tf.summary.FileWriter.

In the sample you sent, that looks potentially painful. Consider monkey patching the Python functions to map to the TensorFlow equivalents when the program starts to only have to do it once (demonstrated here).

As a side note, all of the samples on this page show reading files from GCS.