Why is there no pooler layer in huggingfaces' FlauBERT model?

Question

BERT model for Language Model and Sequence classification includes an extra projection layer between the last transformer and the classification layer (it contains a linear layer of size hidden_dim x hidden_dim, a dropout layer and a tanh activation). This was not described in the paper originally but was clarified here. This intermediate layer is pre-trained together with the rest of the transformers.

In huggingface's BertModel, this layer is called pooler.

According to the paper, FlauBERT model (XLMModel fine-tuned on French corpus) also includes this pooler layer: "The classification head is composed of the following layers, in order: dropout, linear,tanhactivation, dropout, and linear.". However, when loading a FlauBERT model with huggingface (e.g, with FlaubertModel.from_pretrained(...), or FlaubertForSequenceClassification.from_pretrained(...)), the model seem to include no such layer.

Hence the question: why is there no pooler layer in huggingfaces' FlauBERT model ?

Pooler is necessary for the next sentence classification task. This task has been removed from Flaubert training making Pooler an optional layer. HuggingFace commented that "pooler's output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence". Thus I belive they decided to remove the layer. — SvGA
Thanks, this is consistent with an answer I got from one of the authors, saying they basically started from RoBERTa (which does not include the NSP task). You can post your comment as an answer since I consider it solves my question. — Ant Plante

SvGA SvGA · Accepted Answer · 2020-08-25T14:51:08

Pooler is necessary for the next sentence classification task. This task has been removed from Flaubert training making Pooler an optional layer. HuggingFace commented that "pooler's output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence". Thus I belive they decided to remove the layer.

Why is there no pooler layer in huggingfaces' FlauBERT model?

2 Answers