0
votes

I'm having trouble migrating my code from pytorch_pretrained_bert to pytorch_transformers. I'm attempting to run a cosine similarity exercise. I want to extract text embeddings values of the second-to-last of the 12 hidden embedding layer.


import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel
#from pytorch_transofmers import BertTokenizer, BertModel
import pandas as pd
import numpy as np

model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# This is done by default in the pytorch_transformers
model.eval() 

input_query = "This is my test input query text"
marked_text = "[CLS] " + input_query + " [SEP]"
tokenized_text = tokenizer.tokenize(marked_text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
with torch.no_grad():
    encoded_layers, _ = model(tokens_tensor, segments_tensors)
    sentence_embedding = torch.mean(encoded_layers[10], 1)

Using the pytorch_pretrained_bert works perfectly fine with the above code. My encoded_layers object is a list of 12 hidden layer tensors, allowing me to pick and reduce the 11th layer by taking an average, resulting in sentence_embedding object I can run cosine similarities against.

However, when I migrate my code to the pytorch_transformers library, the resulting encoded_layers object is no longer the full list of 12 hidden layers, but a single torch tensor object of shape torch.Size([1, 7, 768]), which results in the following error when I attempt to create the sentence_embedding object:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-23-7f877a7d2f9c> in <module>
      9         encoded_layers, _ = model(tokens_tensor, segments_tensors)
     10         test = encoded_layers[0]
---> 11         sentence_embedding = torch.mean(test[10], 1)
     12 

IndexError: index 10 is out of bounds for dimension 0 with size 7

The migration documentation (https://huggingface.co/transformers/migration.html) states that I should take the first element of the encoded_layers object as a replacement but that does not provide me with access to the second to last hidden layer of embeddings.

How can I access it?

Thank you!

1
What are you trying to compare the sentence similarity to? Is there any specific benefit of using one of the layers in the middle?dennlinger

1 Answers

1
votes

First of all, the newest version is called transformers (not pytorch-transformers).

You need to tell the model that you wish to get all the hidden states

model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)

Then, you'll find your expected output as the third item in the output tuple:

 encoded_layers = model(tokens_tensor, segments_tensors)[2]

IIRC those layers now also include the embeddings (so 13 items in total), so you might need to update the index to get the second last layer. Might be better to use a negative index to be sure (-2).