I am using a model consisting of an embedding layer and an LSTM to perform sequence labelling, in pytorch + torchtext. I have already tokenised the sentences.
If I use self-trained or other pre-trained word embedding vectors, this is straightforward.
But if I use the Huggingface transformers BertTokenizer.from_pretrained
and BertModel.from_pretrained
there is a '[CLS]'
and '[SEP]'
token added to the beginning and end of the sentence, respectively. So the output of the model becomes a sequence that is two elements longer than the label/target sequence.
What I am unsure of is:
- Are these two tags needed for the
BertModel
to embed each token of a sentence "correctly"? - If they are needed, can I take them out after the BERT embedding layer, before the input to the LSTM, so that the lengths are correct in the output?