3
votes

Returns last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)): Sequence of hidden-states at the output of the last layer of the model.

pooler_output (torch.FloatTensor: of shape (batch_size, hidden_size)): Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pre-training.

This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.

hidden_states (tuple(torch.FloatTensor), optional, returned when config.output_hidden_states=True): Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the initial embedding outputs.

attentions (tuple(torch.FloatTensor), optional, returned when config.output_attentions=True): Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

This is from https://huggingface.co/transformers/model_doc/bert.html#bertmodel. Although the description in the document is clear, I still don't understand the hidden_states of returns. There is a tuple, one for the output of the embeddings, and the other for the output of each layer. Please tell me how to distinguish them, or what is the meaning of them? Thanks very much!![wink~

2
You might find this Jupyter Notebook tutorial useful: github.com/BramVanroy/bert-for-inference/blob/master/…Blithering

2 Answers

4
votes

hidden_states (tuple(torch.FloatTensor), optional, returned when config.output_hidden_states=True): Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the initial embedding outputs.

For a given token, its input representation is constructed by summing the corresponding token embedding, segment embedding, and position embedding. This input representation is called the initial embedding output which can be found at index 0 of the tuple hidden_states. This figure explains how the embeddings are calculated. enter image description here

The remaining 12 elements in the tuple contain the output of the corresponding hidden layer. E.g: the last hidden layer can be found at index 12, which is the 13th item in the tuple. The dimension of both the initial embedding output and the hidden states are [batch_size, sequence_length, hidden_size]. It would be useful to compare the indexing of hidden_states bottom-up with this image from the BERT paper.

enter image description here

1
votes

I find the answer in the length of this tuple. The length is (1+num_layers). And the output of the last layer is different from the embedding output, because layer output plus the initial embedding. :D