TRANSIENT_ERROR for TPU in Google Colab

Question

I'm trying to run a lrcn keras model on TPUs with tensorflow 2.0. The model and generator work on CPU/GPU but I included them for reference. I also initialize the TPU and it is visible and everything looks good except for when I run .fit():

def frame_generator(self, batch_size, train_test, data_type):
    """Return a generator that we can use to train on. There are
    a couple different things we can return:
    data_type: 'features', 'images'
    """
    # Get the right dataset for the generator.
    train, test = self.split_train_test()
    data = train if train_test == 'train' else test

    #print("Creating %s generator with %d samples." % (train_test, len(data)))

    while 1:
        X, y = [], []

        # Generate batch_size samples.
        for _ in range(batch_size):
            if random.random() < .5:
                # real
                while True:
                    # Get a random sample.
                    sample = random.choice(data)

                    # Get the sequence from disk.
                    (_x,_y) = self.get_extracted_sequence(data_type, sample)

                    if _y==[0,1]:
                        break
            else:
                 # fake
                while True:
                    # Get a random sample.
                    sample = random.choice(data)

                    # Get the sequence from disk.
                    (_x,_y) = self.get_extracted_sequence(data_type, sample)

                    if _y==[1,0]:
                        break

            if _x is None:
                raise ValueError("Can't find sequence. Did you generate them?", sample)

            X.append(_x)
            y.append(_y)

        #yield [np.array(X), np.array(y)], np.array(y)
        yield np.array(X), np.array(y)

train_generator = data.frame_generator(batch_size, 'train', 'images')
val_generator = data.frame_generator(batch_size, 'test', 'images')

optimizer = Adam(lr=1e-5)

with tpu_strategy.scope():
  model = lrcn()
  model.add(tf.keras.layers.Dense(2, activation='softmax'))

  model.compile(loss='binary_crossentropy',
      optimizer=optimizer,
      metrics=['accuracy', tf.compat.v1.losses.log_loss])
  model.summary() 

train_data = tf.data.Dataset.from_generator(lambda:next(train_generator),
                                        (tf.float32, tf.int64),
                                        ([4, 32,299,299,3], [4,2])     
                                      )

val_data = tf.data.Dataset.from_generator(lambda:next(val_generator),
                                        (tf.float32, tf.int64),
                                      ([4, 32,299,299,3], [4,2]) 
                                      )


model.fit(x=train_data, steps_per_epoch=train_steps, validation_steps=test_steps,
      validation_data=val_data,
        epochs=30,
        callbacks=callbacks,
        verbose=1)

On model.fit I get:

Train on 6421.0 steps, validate on 1605.0 steps

Epoch 1/30

UnavailableError Traceback (most recent call last) in () 15 epochs=30, 16 callbacks=callbacks, ---> 17 verbose=1)

11 frames /usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)

UnavailableError: channel is in state TRANSIENT_FAILURE Additional GRPC error information: {"created":"@1584561754.347859160","description":"channel is in state TRANSIENT_FAILURE","file":"external/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":2294,"grpc_status":14} [Op:__inference_distributed_function_24182 channel is in state TRANSIENT_FAILURE","file":"external/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":2294,"grpc_status":14} [Op:__inference_distributed_function_10577]

Any ideas how to fix? Looks like it's on Google's network end.

UPDATE:

Part of the solution is you should not install tensorflow2.1 with pip in the colab notebook - you should use in its own cell before "import tensorflow"

%tensorflow_version 2.x

This will change the TPU version from 1.15 to >=2.1

Now when I run the notebook I get more details:

Train for 6902.0 steps, validate for 1725.0 steps Epoch 1/30

1/6902 [..............................] - ETA: 20:04:55

NotFoundError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py in on_epoch(self, epoch, mode) 766 try: --> 767 yield epoch_logs 768 finally:

18 frames NotFoundError: {{function_node __inference_distributed_function_20824}} No registered 'PyFunc' OpKernel for 'CPU' devices compatible with node {{node PyFunc}} . Registered:

 [[PyFunc]]
 [[MultiDeviceIteratorGetNextFromShard]]
 [[RemoteCall]]
 [[IteratorGetNextAsOptional]]

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py in _get_file_path(self, epoch, logs) 1053 if not self.model._in_multi_worker_mode( 1054 ) or multi_worker_util.should_save_checkpoint(): -> 1055 return self.filepath.format(epoch=epoch + 1, **logs) 1056 else: 1057 # If this is multi-worker training, and this worker should not

KeyError: 'val_accuracy'

Seth Kitchen Seth Kitchen · Accepted Answer · 2020-03-18T23:43:54

TL/DR

You need to install a newer build that will execute the python function before sending it to the TPU. Load newer builds via

import requests
import os
url = 'http://' + os.environ['COLAB_TPU_ADDR'].split(':')[0] + ':8475/requestversion/2.2.0-dev20200311'
resp = requests.post(url)
print(resp)
%pip install tf-nightly==2.2.0-dev20200311

From https://github.com/tensorflow/tensorflow/issues/34346

When you use Dataset.from_generator (or pass a generator to Keras which will call it under the hood), the Dataset embeds the generator in a PyFunc op in its graph, and every time that op is invoked it calls next on the generator and gets the resultant bytes. (Basically treating Python as a black box.)

When everything is running on the same machine this is fine, but the trouble is that the ways TPUs work is that there is a separate machine controlling the TPU (called, imaginatively, the TPU host controller. ^^), and you run things on the TPU by sending it a TensorFlow graph to execute. So the graph containing that PyFunc gets sent to the TPU, and the TPU can't execute it because there is no Python on the TPU host machine. (And even if there was, it wouldn't be the same interpreter with the same state as your local machine.) So it fails by telling you it can't execute the PyFunc op, but not in a very clear way unfortunately.

TRANSIENT_ERROR for TPU in Google Colab

Epoch 1/30

1/6902 [..............................] - ETA: 20:04:55

1 Answers