Feed stacked RNN output into fully connected layer

Question

I'm trying to solve a regression problem with a stacked RNN in tensorflow. The RNN output should be fed into a fully connected layer for the final prediction. Currently I'm struggeling on how to feed the RNN output into the final fully_connected layer. My input is of shape [batch_size, max_sequence_length, num_features]

The RNN Layers are created like this:

cells = []
for i in range(num_rnn_layers):
    cell = tf.contrib.rnn.LSTMCell(num_rnn_units)
    cells.append(cell)

multi_rnn_cell = tf.contrib.rnn.MultiRNNCell(cells)


outputs, states = tf.nn.dynamic_rnn(cell=multi_rnn_cell,
                        inputs=Bx_rnn,dtype=tf.float32)

Outputs is of shape [batch_size, max_sequence_length, num_rnn_units] I tried using only the output of the last time step like this:

final_outputs = tf.contrib.layers.fully_connected(
   outputs[:,-1,:],
   n_targets,
   activation_fn=None)

I also found examples and books recommending to reshape the output and target like this:

rnn_outputs = tf.reshape(outputs, [-1, num_rnn_units])
y_reshaped = tf.reshape(y, [-1])

Since I'm currently using a batch size of 500 and a sequence length of 10000 this leads into huge matrices, really long training times and huge memory consumption.

I've also found many articles recommending unstacking the inputs and stacking outputs again, which I couldn't get to work due to shape mismatches.

What would be the correct way to feed the RNN output into a fully_connected layer? Or should I use the RNN states over outputs?

Edit: For Clarification: I do need these long sequences, because I'm trying to model a physical system. The Input is a single feature, consisting of a white noise. I have multiple outputs (in this specific system 45). Impulses effect System state for round about 10.000 time steps.

i.e. currently I'm trying to model a cars gear bridging which was animated by a shaker. Outputs were measured by 15 acceleration sensors into 3 directions (X,Y & Z).

Batch size of 500 was arbitrarily picked.

Regardless of probably vanishing gradients or potential memory issues by long sequences, I'd be interested in how to feed data correctly. We do have appropriate hardware (i.e. Nvidia Titan V). Furthermore we were already able to model system behaviour by classic DNN's with lags of >3000 time steps with good accuracy.

Can you explain why you need 10000 time steps and 500 batch size? Is the content of your sequences homogeneous or structured? What type of data/dimension of input and number of classes? — pixelou

pixelou pixelou · Accepted Answer · 2018-04-13T09:52:10

I believe 10000 time-steps is very long by any standard. This causes or will cause several issues:

memory issues as you observed: backpropagating the gradient requires to store all states along time
performance in training: even for gated units the gradient will probably not reach the first time-steps anyways
performance in prediction: assuming the network is properly trained, it's unlikely that the first observations will have any impact on the final state value and consequently on the prediction, so taking 10000 time steps is a waste of time.

Among other solutions you can:

Process the sequence by (possibly overlaping) chunks of smaller size and train the model to give predictions over each of them and then aggregate predictions or do some intermediate fusion which will be more tricky to implement, especially with variable length sequence durations.
Aggregate and/or subsample input or use any other trick to reduce apparent duration, maybe use temporal convolution before subsampling if you are afraid of loosing fine temporal patterns.

Feed stacked RNN output into fully connected layer

1 Answers