I'm having trouble migrating my code from pytorch_pretrained_bert
to pytorch_transformers
. I'm attempting to run a cosine similarity exercise. I want to extract text embeddings values of the second-to-last of the 12 hidden embedding layer.
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel
#from pytorch_transofmers import BertTokenizer, BertModel
import pandas as pd
import numpy as np
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# This is done by default in the pytorch_transformers
model.eval()
input_query = "This is my test input query text"
marked_text = "[CLS] " + input_query + " [SEP]"
tokenized_text = tokenizer.tokenize(marked_text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
with torch.no_grad():
encoded_layers, _ = model(tokens_tensor, segments_tensors)
sentence_embedding = torch.mean(encoded_layers[10], 1)
Using the pytorch_pretrained_bert works perfectly fine with the above code. My encoded_layers
object is a list of 12 hidden layer tensors, allowing me to pick and reduce the 11th layer by taking an average, resulting in sentence_embedding
object I can run cosine similarities against.
However, when I migrate my code to the pytorch_transformers
library, the resulting encoded_layers
object is no longer the full list of 12 hidden layers, but a single torch tensor object of shape torch.Size([1, 7, 768])
, which results in the following error when I attempt to create the sentence_embedding
object:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-23-7f877a7d2f9c> in <module>
9 encoded_layers, _ = model(tokens_tensor, segments_tensors)
10 test = encoded_layers[0]
---> 11 sentence_embedding = torch.mean(test[10], 1)
12
IndexError: index 10 is out of bounds for dimension 0 with size 7
The migration documentation (https://huggingface.co/transformers/migration.html) states that I should take the first element of the encoded_layers
object as a replacement but that does not provide me with access to the second to last hidden layer of embeddings.
How can I access it?
Thank you!