BERT model for Language Model and Sequence classification includes an extra projection layer between the last transformer and the classification layer (it contains a linear layer of size hidden_dim x hidden_dim
, a dropout layer and a tanh
activation). This was not described in the paper originally but was clarified here. This intermediate layer is pre-trained together with the rest of the transformers.
In huggingface's BertModel
, this layer is called pooler
.
According to the paper, FlauBERT model (XLMModel fine-tuned on French corpus) also includes this pooler layer: "The classification head is composed of the following layers, in order: dropout, linear,tanhactivation, dropout, and linear.". However, when loading a FlauBERT model with huggingface (e.g, with FlaubertModel.from_pretrained(...)
, or FlaubertForSequenceClassification.from_pretrained(...)
), the model seem to include no such layer.
Hence the question: why is there no pooler layer in huggingfaces' FlauBERT model ?