4
votes

I am using the Scibert pretrained model to get embeddings for various texts. The code is as follows:

from transformers import *

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', model_max_length=512, truncation=True)
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

I have added both the max length and truncation parameters to tokenizers, but unfortunately, they don't truncate the results.If I run a longer text through the tokenizer:

inputs = tokenizer("""long text""")

I get the following error:

Token indices sequence length is longer than the specified maximum sequence length for this model (605 > 512). Running this sequence through the model will result in indexing errors

Now obviously I can't run this through the model due to having too long sequences of tensors. What is the easiest way to truncate the input to fit the maximum sequence length of 512?

1

1 Answers

7
votes

truncation is not a parameter of the class constructor (class reference), but a parameter of the __call__ method. Therefore you should use:

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', model_max_length=512)

len(tokenizer(text, truncation=True).input_ids)

Output:

512