TensorFlow 2 GRU Layer with multiple hidden layers

Question

I am attempting to port some TensorFlow 1 code to TensorFlow 2. The old code used the now deprecated MultiRNNCell to create a GRU layer with multiple hidden layers. In TensorFlow 2 I want to use the in-built GRU Layer, but there doesn't seem to be an option which allows for multiple hidden layers with that class. The PyTorch equivalent has such an option exposed as an initialization parameter, num_layers.

My workaround has been to use the TensorFlow RNN layer and pass a GRU cell for each hidden layer I want - this is the way recommended in the docs:

dim = 1024
num_layers = 4
cells = [tf.keras.layers.GRUCell(dim) for _ in range(num_layers)]
gru_layer = tf.keras.layers.RNN(
  cells,
  return_sequences=True,
  stateful=True
)

But the in-built GRU layer has support for CuDNN, which the plain RNN seems to lack, to quote the docs:

Mathematically, RNN(LSTMCell(10)) produces the same result as LSTM(10). In fact, the implementation of this layer in TF v1.x was just creating the corresponding RNN cell and wrapping it in a RNN layer. However using the built-in GRU and LSTM layers enables the use of CuDNN and you may see better performance.

So how can I achieve this? How do I get a GRU layer that supports both multiple hidden layers and has support for CuDNN? Given that the inbuilt GRU layer in TensorFlow lacks such an option, is it in fact necessary? Or is the only way to get a deep GRU network is to stack multiple GRU layers in a sequence?

EDIT: It seems, according to this answer to a similar question, that there is indeed no in-built way to create a GRU Layer with multiple hidden layers, and that they have to be stacked manually.

What are these hidden layers, conceptually? Are they parallel? Are they serial? What do they do? — Daniel Möller
Thanks. I think this can be done using the Sequential model, wrapping multiple GRU Layer instances. So it's serial/sequential, I guess. Like the answer linked to in my edit. I'm trying to achieve what is already achievable using the RNN class with a list of cells - a stack of layers, a deeper network. But with CuDNN optimization. The docs state the latter is only available for the in-built GRU Layer class. — ChrisM

ChrisM ChrisM · Accepted Answer · 2020-04-09T22:19:36

OK, so it seems the only way to achieve this is to define a stack of GRU Layer instances. This is what I came up with (note that I only need stateful GRU layers that return sequences, and don't need the last layer's return state):

class RNN(tf.keras.layers.Layer):

    def __init__(self, dim, num_layers=1):
        super(RNN, self).__init__()
        self.dim = dim
        self.num_layers = num_layers
        def layer():
            return tf.keras.layers.GRU(
                self.dim,
                return_sequences=True,
                return_state=True,
                stateful=True)
        self._layer_names = ['layer_' + str(i) for i in range(self.num_layers)]
        for name in self._layer_names:
             self.__setattr__(name, layer())

    def call(self, inputs):
        seqs = inputs
        state = None
        for name in self._layer_names:
            rnn = self.__getattribute__(name)
            (seqs, state) = rnn(seqs, initial_state=state)
        return seqs

It's necessary to manually add the internal rnn layers to the parent layer using __setattr__. It seems adding the rnns to a list and setting that as a layer attribute won't allow the internal layers to be tracked by the parent layer (see this answer to this issue).

I hoped that this would speed up my network. Tests on Colab have showed no difference so far, if anything it's actually slightly slower than using a straight RNN initialized with a list of GRU cells. I thought that increasing the batch size from 10 to 64 might make a difference, but no, they still seem to be performing at around the same speed.

UPDATE: In fact there does seem to be a noticeable speed up, but only if I don't decorate my training step function with tf.function (I have a custom training loop, I don't use Model.fit). Not a huge increase in speed - maybe about 33% faster, with a batch size of 96. A much smaller batch size (between 10 to 20) gives an even bigger speed up, about 70%.

TensorFlow 2 GRU Layer with multiple hidden layers

1 Answers