Huggigface BERT implementation has a hack to remove the pooler from optimizer.
# hack to remove pooler, which is not used
# thus it produce None grad that break apex
param_optimizer = [n for n in param_optimizer if 'pooler' not in n[0]]
We are trying to run pretrining on huggingface bert models. The code always diverges later during the training if this pooler hack is not applied. I also see the pooler layer being used during classification.
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
The pooler layer is a FFN with tanh activation
class BertPooler(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output
My question is why this pooler hack solves numeric instability?
Problem seen with pooler