0
votes

I am using a model consisting of an embedding layer and an LSTM to perform sequence labelling, in pytorch + torchtext. I have already tokenised the sentences.

If I use self-trained or other pre-trained word embedding vectors, this is straightforward.

But if I use the Huggingface transformers BertTokenizer.from_pretrained and BertModel.from_pretrained there is a '[CLS]' and '[SEP]' token added to the beginning and end of the sentence, respectively. So the output of the model becomes a sequence that is two elements longer than the label/target sequence.

What I am unsure of is:

  1. Are these two tags needed for the BertModel to embed each token of a sentence "correctly"?
  2. If they are needed, can I take them out after the BERT embedding layer, before the input to the LSTM, so that the lengths are correct in the output?
1

1 Answers

1
votes
  1. Yes, BertModel needed them since without those special symbols added, the output representations would be different. However, my experience says, if you fine-tune BertModel on the labeling task without [CLS] and [SEP] token added, then you may not see a significant difference. If you use BertModel to extract fixed word features, then you better add those special symbols.

  2. Yes, you can take out the embedding of those special symbols. In fact, this is a general idea for sequence labeling or tagging tasks.

I suggest taking a look at some sequence labeling or tagging examples using BERT to become confident about your modeling decisions. You can find NER tagging example using Huggingface transformers here.