0
votes

I see that Tensorboard process is running. Files are written into the model directory. However, repeatedly I get the Exception: Unable to start Tensorboard. I am using TF.estimator.

I am running my code on Google Cloud Datalab. I have tried changing model directory and restarting the Datalab instance many times. Also tried running killing all running Tensorboard processes. Nothing has worked so far. It was working earlier or once in every 10-15 attempts it magically runs. Whats happening?

This is how I am starting Tensorboard.

from google.datalab.ml import TensorBoard as tb
tb.start(model_dir)

This is how my Estimator is configured.

run_config = tf.estimator.RunConfig(
  save_checkpoints_steps=FLAGS.save_checkpoints_steps,
  tf_random_seed=FLAGS.tf_random_seed,
  model_dir=model_dir
)

estimator = tf.estimator.Estimator(model_fn=model_fn, 
config=run_config)

Below are the files being written into the model directory by tf.estimator.

eval 8 minutes ago

checkpoint 124 B 9 minutes ago

events.out.tfevents.1559025239.78fe4cbf0fad 603 kB 9 minutes ago

graph.pbtxt 399 kB 12 minutes ago

model.ckpt-1.data-00000-of-00001 261 MB 11 minutes ago

model.ckpt-1.index 811 B 11 minutes ago

model.ckpt-1.meta 170 kB 11 minutes ago

model.ckpt-5.data-00000-of-00001 261 MB 9 minutes ago

model.ckpt-5.index 811 B 9 minutes ago

model.ckpt-5.meta 170 kB 9 minutes ago

The error I am getting is below. It is the same everytime and I have no further information to identify what is going wrong.

Exception Traceback (most recent call >last) in () 2 #tensorboard --logdir ./logs/1/train --host localhost --port 8081 3 from google.datalab.ml import TensorBoard as tb ----> 4 tb.start(model_dir)

/usr/local/envs/py3env/lib/python3.5/site-packages/google/datalab/ml/_tensorboard.py in start(logdir) 77 retry -= 1 78 ---> 79 raise Exception('Cannot start TensorBoard.') 80 81 @staticmethod

Exception: Cannot start TensorBoard.

When I list the Tensorboard processes running using below code, below is what I get.

x = tb.list() #Returns a dataframe
print(x)
      logdir   pid   port

0 ./model_no_reuse/2 6236 40269
1 ./model_no_reuse/2 6241 57895

Please help me identify what is going wrong.

1
Just adding some additional information based on a hunch - does Tensorboard require some minimum amount of CPU/Memory to be launched? I am using a small VM for development purposes which is 2 vCPU and 4.5 GB memory (Intel Skylake).Anand

1 Answers

0
votes

I tried increasing the VM configuration from 2 vCPU/4.5 GB to 4 vCPU/20GB and the issue is resolved. It looks like even though Tensorboard process does get started, for it to open up certain minimum resources are needed. Will change the answer if I arrive at any other conclusion.