How to truncate a Bert tokenizer in Transformers library

Question

I am using the Scibert pretrained model to get embeddings for various texts. The code is as follows:

from transformers import *

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', model_max_length=512, truncation=True)
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

I have added both the max length and truncation parameters to tokenizers, but unfortunately, they don't truncate the results.If I run a longer text through the tokenizer:

inputs = tokenizer("""long text""")

I get the following error:

Token indices sequence length is longer than the specified maximum sequence length for this model (605 > 512). Running this sequence through the model will result in indexing errors

Now obviously I can't run this through the model due to having too long sequences of tensors. What is the easiest way to truncate the input to fit the maximum sequence length of 512?

cronoik cronoik · Accepted Answer · 2020-11-27T13:48:13

truncation is not a parameter of the class constructor (class reference), but a parameter of the __call__ method. Therefore you should use:

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', model_max_length=512)

len(tokenizer(text, truncation=True).input_ids)

Output:

How to truncate a Bert tokenizer in Transformers library

1 Answers