I understand how the BERT tokenizer works thanks to this article: https://albertauyeung.github.io/2020/06/19/bert-tokenization.html
However, I am confused about how this ends up as the final input shape (b, 24, 768).
When reading the code of BERT I noticed this comment about the embeddings.
BERT Embedding which is consisted with under features
1. TokenEmbedding : normal embedding matrix
2. PositionalEmbedding : adding positional information using sin, cos
2. SegmentEmbedding : adding sentence segment info, (sent_A:1, sent_B:2)
sum of all these features are output of BERTEmbedding
Does this mean that BERT does the following?
- Tokenizes a sentence
- Put these tokens through a separate system(?) that produces high dimensional embeddings
- Creates a positional embedding based on word position in the sentence.
- And (I also get confused here) creates a segment embedding providing information about the sentence as a whole (what information?)
- This is all added together to create a tensor of shape (b, 24, 768) where 24 words/tokens (plus padding) are each represented in 768-dimensional space.
Is this correct? What is the segment embedding information?