I am using the Scibert pretrained model to get embeddings for various texts. The code is as follows:
from transformers import *
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', model_max_length=512, truncation=True)
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')
I have added both the max length and truncation parameters to tokenizers, but unfortunately, they don't truncate the results.If I run a longer text through the tokenizer:
inputs = tokenizer("""long text""")
I get the following error:
Token indices sequence length is longer than the specified maximum sequence length for this model (605 > 512). Running this sequence through the model will result in indexing errors
Now obviously I can't run this through the model due to having too long sequences of tensors. What is the easiest way to truncate the input to fit the maximum sequence length of 512?