For the second question: hidden states at padded sequences will not be computed.
To answer how is that happening, let's first see what pack_padded_sequence
does for us:
from torch.nn.utils.rnn import pad_sequence, pad_packed_sequence, pack_padded_sequence
raw = [ torch.ones(25, 300) / 2,
torch.ones(22, 300) / 2.3,
torch.ones(15, 300) / 3.2 ]
padded = pad_sequence(raw) # size: [25, 3, 300]
lengths = torch.as_tensor([25, 22, 15], dtype=torch.int64)
packed = pack_padded_sequence(padded, lengths)
so far we randomly created a three tensor with different length (timestep in the context of RNN) , and we first pad them to the same length, then packed it. Now if we run
print(padded.size())
print(packed.data.size()) # packed.data refers to the "packed" tensor
we will see:
torch.Size([25, 3, 300])
torch.Size([62, 300])
Obviously 62 does not come from 25 * 3. So what pack_padded_sequence
does is only keep the meaningful timestep of each batch entry according to the lengths
tensor we passed to pack_padded_sequence
(i.e. if we passed [25, 25, 25] to it, the size of packed.data
would still be [75, 300] even though the raw tensor does not change). In short, rnn would no even see the pad timestep with pack_padded_sequence
And now let's see what's the difference after we pass padded
and packed
to rnn
rnn = torch.nn.RNN(input_size=300, hidden_size=2)
padded_outp, padded_hn = rnn(padded) # size: [25, 3, 2] / [1, 3, 2]
packed_outp, packed_hn = rnn(packed) # 'PackedSequence' Obj / [1, 3, 2]
undo_packed_outp, _ = pad_packed_sequence(packed_outp)
# return "h_n"
print(padded_hn) # tensor([[[-0.2329, -0.6179], [-0.1158, -0.5430],[ 0.0998, -0.3768]]])
print(packed_hn) # tensor([[[-0.2329, -0.6179], [ 0.5622, 0.1288], [ 0.5683, 0.1327]]]
# the output of last timestep (the 25-th timestep)
print(padded_outp[-1]) # tensor([[[-0.2329, -0.6179], [-0.1158, -0.5430],[ 0.0998, -0.3768]]])
print(undo_packed_outp.data[-1]) # tensor([[-0.2329, -0.6179], [ 0.0000, 0.0000], [ 0.0000, 0.0000]]
The values of padded_hn
and packed_hn
are different since rnn DOES compute the pad for padded
yet not for the 'packed' (PackedSequence object), which also can be observed from the last hidden state: all three batch entry in padded
got non-zero last hidden state even if its length is less than 25. But for packed
, the last hidden state for shorter data is not computed (i.e. 0)
p.s. another observation:
print([(undo_packed_outp[:, i, :].sum(-1) != 0).sum() for i in range(3)])
would give us [tensor(25), tensor(22), tensor(15)]
, which align to the actual length of our input.