1
votes

I am getting the following error :

AssertionError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples)., when I run classifier(encoded). My text type is str so I am not sure what I am doing wrong. Any help is very appreciated.

import torch
from transformers import AutoTokenizer, BertTokenizer, BertModel, BertForMaskedLM, AutoModelForSequenceClassification, pipeline

# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
# used the cased instead of uncased to account for cases like BAD.
tokenizer = BertTokenizer.from_pretrained('bert-base-cased') 


# alternative? what is the difference between these two tokenizers? 
#tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-SST-2")

model = AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-SST-2")


# feed the model and the tokenizer into the pipeline
classifier = pipeline('sentiment-analysis', model=model, tokenizer= tokenizer)


#---------------sample raw input passage--------

text = "Who was Jim Henson ? Jim Henson was a puppeteer. He is simply awful."
# tokenized_text = tokenizer.tokenize(text)

#----------Tokenization and Padding---------
# Encode the sentences to get tokenized and add padding stuff
encoded = tokenizer.encode_plus(
    text=text,  # the sentences to be encoded
    add_special_tokens=True,  # Add [CLS] and [SEP] !!!
    max_length = 64,  # maximum length of a sentence  (TODO Figure the longest passage length)
    pad_to_max_length=True,  # Add [PAD]s
    return_attention_mask = True,  # Generate the attention mask
    truncation=True,  #explicitly truncate examples to max length
    return_tensors = 'pt',  # ask the function to return PyTorch tensors
)

#-------------------------------------------
# view the IDs
for key, value in encoded.items():
    print(f"{key}: {value.numpy().tolist()}")
    
#-------------------------------------------


classifier(encoded)

1

1 Answers

0
votes

The pipeline already includes the encoder. Instead of

classifier(encoded)

do

classifier(text)