6
votes

I want to understand better those shape of the Tensorflow´s BasicLSTMCell Kernel and Bias.

@tf_export("nn.rnn_cell.BasicLSTMCell")
class BasicLSTMCell(LayerRNNCell):

input_depth = inputs_shape[1].value
h_depth = self._num_units
self._kernel = self.add_variable(
    _WEIGHTS_VARIABLE_NAME,
    shape=[input_depth + h_depth, 4 * self._num_units])
self._bias = self.add_variable(
    _BIAS_VARIABLE_NAME,
    shape=[4 * self._num_units],
    initializer=init_ops.zeros_initializer(dtype=self.dtype))

Why does the kernel have the shape=[input_depth + h_depth, 4 * self._num_units]) and the bias the shape = [4 * self._num_units] ? Maybe the factor 4 come from the forget gate, block input, input gate and output gate? And what´s the reason for the summation of input_depth and h_depth?

More information about my LSTM Network:

num_input = 12, timesteps = 820, num_hidden = 64, num_classes = 2.

With tf.trainables_variables() i get the following information:

  • Variable name: Variable:0 Shape: (64, 2) Parameters: 128
  • Variable name: Variable_1:0 Shape: (2,) Parameters: 2
  • Variable name: rnn/basic_lstm_cell/kernel:0 Shape: (76, 256) Parameters: 19456
  • Variable name: rnn/basic_lstm_cell/bias:0 Shape: (256,) Parameters: 256

The following Code defines my LSTM Network.

def RNN(x, weights, biases):

    x = tf.unstack(x, timesteps, 1)
    lstm_cell = rnn.BasicLSTMCell(num_hidden)
    outputs, states = rnn.static_rnn(lstm_cell, x, dtype=tf.float32)

    return tf.matmul(outputs[-1], weights['out']) + biases['out']
1

1 Answers

6
votes

First, about summing input_depth and h_depth: RNNs generally follow equations like h_t = W*h_t-1 + V*x_t to compute the state h at time t. That is, we apply a matrix multiplication to the last state and the current input and add the two. This is actually equivalent to concatenating h_t-1 and x_t (let's just call this c), "stacking" the two matrices W and V (let's just call this S) and computing S*c.
Now we only have one matrix multiplication instead of two; I believe this can be parallelized more effectively so this is done for performance reasons. Since h_t-1 has size h_depth and x has size input_depth we need to add the two dimensionalities for the concatenated vector c.

Second, you are right about the factor 4 coming from the gates. This is essentially the same as above: Instead of carrying out four separate matrix multiplications for the input and each of the gates, we carry out one multiplication that results in a big vector that is the input and all four gate values concatenated. Then we can just split this vector into four parts. In the LSTM cell source code this happens in lines 627-633.