I see that Tensorboard process is running. Files are written into the model directory. However, repeatedly I get the Exception: Unable to start Tensorboard. I am using TF.estimator.
I am running my code on Google Cloud Datalab. I have tried changing model directory and restarting the Datalab instance many times. Also tried running killing all running Tensorboard processes. Nothing has worked so far. It was working earlier or once in every 10-15 attempts it magically runs. Whats happening?
This is how I am starting Tensorboard.
from google.datalab.ml import TensorBoard as tb
tb.start(model_dir)
This is how my Estimator is configured.
run_config = tf.estimator.RunConfig(
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
tf_random_seed=FLAGS.tf_random_seed,
model_dir=model_dir
)
estimator = tf.estimator.Estimator(model_fn=model_fn,
config=run_config)
Below are the files being written into the model directory by tf.estimator.
eval 8 minutes ago
checkpoint 124 B 9 minutes ago
events.out.tfevents.1559025239.78fe4cbf0fad 603 kB 9 minutes ago
graph.pbtxt 399 kB 12 minutes ago
model.ckpt-1.data-00000-of-00001 261 MB 11 minutes ago
model.ckpt-1.index 811 B 11 minutes ago
model.ckpt-1.meta 170 kB 11 minutes ago
model.ckpt-5.data-00000-of-00001 261 MB 9 minutes ago
model.ckpt-5.index 811 B 9 minutes ago
model.ckpt-5.meta 170 kB 9 minutes ago
The error I am getting is below. It is the same everytime and I have no further information to identify what is going wrong.
Exception Traceback (most recent call >last) in () 2 #tensorboard --logdir ./logs/1/train --host localhost --port 8081 3 from google.datalab.ml import TensorBoard as tb ----> 4 tb.start(model_dir)
/usr/local/envs/py3env/lib/python3.5/site-packages/google/datalab/ml/_tensorboard.py in start(logdir) 77 retry -= 1 78 ---> 79 raise Exception('Cannot start TensorBoard.') 80 81 @staticmethod
Exception: Cannot start TensorBoard.
When I list the Tensorboard processes running using below code, below is what I get.
x = tb.list() #Returns a dataframe
print(x)
logdir pid port
0 ./model_no_reuse/2 6236 40269
1 ./model_no_reuse/2 6241 57895
Please help me identify what is going wrong.