Returns last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)): Sequence of hidden-states at the output of the last layer of the model.
pooler_output (torch.FloatTensor: of shape (batch_size, hidden_size)): Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pre-training.
This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.
hidden_states (tuple(torch.FloatTensor), optional, returned when config.output_hidden_states=True): Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when config.output_attentions=True): Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
This is from https://huggingface.co/transformers/model_doc/bert.html#bertmodel. Although the description in the document is clear, I still don't understand the hidden_states of returns. There is a tuple, one for the output of the embeddings, and the other for the output of each layer. Please tell me how to distinguish them, or what is the meaning of them? Thanks very much!![wink~