1
votes

I understand how the BERT tokenizer works thanks to this article: https://albertauyeung.github.io/2020/06/19/bert-tokenization.html

However, I am confused about how this ends up as the final input shape (b, 24, 768).

When reading the code of BERT I noticed this comment about the embeddings.

BERT Embedding which is consisted with under features
    1. TokenEmbedding : normal embedding matrix
    2. PositionalEmbedding : adding positional information using sin, cos
    2. SegmentEmbedding : adding sentence segment info, (sent_A:1, sent_B:2)
    sum of all these features are output of BERTEmbedding

Does this mean that BERT does the following?

  1. Tokenizes a sentence
  2. Put these tokens through a separate system(?) that produces high dimensional embeddings
  3. Creates a positional embedding based on word position in the sentence.
  4. And (I also get confused here) creates a segment embedding providing information about the sentence as a whole (what information?)
  5. This is all added together to create a tensor of shape (b, 24, 768) where 24 words/tokens (plus padding) are each represented in 768-dimensional space.

Is this correct? What is the segment embedding information?

1

1 Answers

1
votes

Yes, your descriptions are almost correct.

  1. Every sentence is word-piece tokenized first.

  2. During BERT training/fine-tuning, every token has learnt its token embeddings (In the form of embedding layer).

  3. Yes, the position embedding is unique for every position (It uses sum of sine and cosine values ranging from 0-1023, I might be slightly wrong here, but we can assume that every position has it's unique position emebedding). 4.BERT is trained on Masked Language Model and Next Sentence Prediction task (NSP). For, NSP you can pass two consecutive sentences or non-consecutive sentences and learn a classifier to determine if the 2nd sentence is consecutive to the 1st sentence or not. Also, 1st sentence can be called as segment A, and likewise segment B for the 2nd sentence. The embeddings for segments are also learnt. Now, all the words from 1st sentence will have the same segment embeddings and all the words from the 2nd sentence will have the same segment embeddings.

  4. The token embeddings, position emebeddings, and segment embeddings have the same dimention, i.e., 768. They are all summed up to form the input embeddings.

Thus, if a sentence is tokenised (and padded) to the length 24, and the batch size is assumed to be b, the input dimension will be (b,24,768)