Variable sentence length for LSTM using word2vec as inputs on tensorflow

Question

I am building an LSTM Model using word2vec as an input. I am using the tensorflow framework. I have finished word embedding part, but I am stuck with LSTM part.

The issue here is that I have different sentence lengths, which means that I have to either do padding or use dynamic_rnn with specified sequence length. I am struggling with both of them.

Padding. The confusing part of padding is when I do padding. My model goes like

word_matrix=model.wv.syn0
X = tf.placeholder(tf.int32, shape)
data = tf.placeholder(tf.float32, shape)
data = tf.nn.embedding_lookup(word_matrix, X)

Then, I am feeding sequences of word indices for word_matrix into X. I am worried that if I pad zero's to the sequences fed into X, then I would incorrectly keep feeding unnecessary input (word_matrix[0] in this case).

So, I am wondering what is the correct way of 0 padding. It would be great if you let me know how to implement it with tensorflow.

dynamic_rnn For this, I have declared a list containing all the lengths of sentences and feed those along with X and y at the end. In this case, I cannot feed the inputs as batch though. Then, I have encountered this error (ValueError: as_list() is not defined on an unknown TensorShape.), which seems to me that sequence_length argument only accepts list? (My thoughts might be entirely incorrect though).

The following is my code for this.

X = tf.placeholder(tf.int32)
labels = tf.placeholder(tf.int32, [None, numClasses])
length = tf.placeholder(tf.int32)

data = tf.placeholder(tf.float32, [None, None, numDimensions])
data = tf.nn.embedding_lookup(word_matrix, X)

lstmCell = tf.contrib.rnn.BasicLSTMCell(lstmUnits, state_is_tuple=True)
lstmCell = tf.contrib.rnn.DropoutWrapper(cell=lstmCell, output_keep_prob=0.25)
initial_state=lstmCell.zero_state(batchSize, tf.float32)
value, _ = tf.nn.dynamic_rnn(lstmCell, data, sequence_length=length,
                             initial_state=initial_state, dtype=tf.float32)

I am so struggling with this part so that any help would be very much appreciated.

Thank you in advance.

burglarhobbit burglarhobbit · Accepted Answer · 2018-09-27T09:26:30

Tensorflow does not support variable length Tensor. So when you declare a Tensor, the list/numpy array should have a uniform shape.

From your 1st part, what I understand is that you were already able to pad the zeros in the last time steps of the sequence length. Which is what the ideal situation should be. Here is how it should look for a batch size of 4, max sequence length 10 and 50 hidden units ->

[4,10,50] would be the size of your whole batch, but internally, it may be shaped like this when you try to visualize the paddings ->
```
`[[5+5pad,50],[10,50],[8+2pad,50],[9+1pad,50]`
```
Each pad would represent a sequence length of 1 with hidden state size 50 Tensor. All filled with nothing but zeroes. Look at this question and this one to know more about how to pad manually.
You will use dynamic rnn for the exact reason that you do not want to compute it on the padding sequences. The tf.nn.dynamic_rnn api will ensure that by passing the sequence_length argument.

For the above example, that argument will be: [5,10,8,9] for the example above. You can compute it by summing the non-zero entities for each batch component. A simple way to compute that would be:
```
data_mask = tf.cast(data, tf.bool)
data_len = tf.reduce_sum(tf.cast(data_mask, tf.int32), axis=1)
```
and pass it in the tf.nn.dynamic_rnn api:
```
tf.nn.dynamic_rnn(lstmCell, data, sequence_length=data_len, initial_state=initial_state)
```

Variable sentence length for LSTM using word2vec as inputs on tensorflow

1 Answers