I'm attempting to fine-tune the HuggingFace TFBertModel to be able to classify some text to a single label. I have the model up and running, however the accuracy is extremely low from the start. My expectation is that the accuracy would be high given that it is using the BERT pre-trained weights as a starting point. I was hoping to get some advice on where I'm going wrong.
I'm using the bbc-text dataset from here:
Load Data
df = pd.read_csv(open(<s3 url>),encoding='utf-8', error_bad_lines=False)
df = df.sample(frac=1)
df = df.dropna(how='any')
Value Counts
sport 511
business 510
politics 417
tech 401
entertainment 386
Name: label, dtype: int64
Preprocessing
def preprocess_text(sen):
# Convert html entities to normal
sentence = unescape(sen)
# Remove html tags
sentence = remove_tags(sentence)
# Remove newline chars
sentence = remove_newlinechars(sentence)
# Remove punctuations and numbers
sentence = re.sub('[^a-zA-Z]', ' ', sentence)
# Convert to lowercase
sentence = sentence.lower()
return sentence
def remove_newlinechars(text):
return " ".join(text.splitlines())
def remove_tags(text):
TAG_RE = re.compile(r'<[^>]+>')
return TAG_RE.sub('', text)
df['text_prepd'] = df['text'].apply(preprocess_text)
Split Data
train, val = train_test_split(df, test_size=0.30, shuffle=True, stratify=df['label'])
Encode Labels
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y_train = np.asarray(le.fit_transform(train['label']))
y_val = np.asarray(le.fit_transform(val['label']))
Define BERT input function
# Initialise Bert Tokenizer
bert_tokenizer_transformer = BertTokenizer.from_pretrained('bert-base-cased')
def create_input_array(df, tokenizer, args):
sentences = df.text_prepd.values
input_ids = []
attention_masks = []
token_type_ids = []
for sent in tqdm(sentences):
# `encode_plus` will:
# (1) Tokenize the sentence.
# (2) Prepend the `[CLS]` token to the start.
# (3) Append the `[SEP]` token to the end.
# (4) Map tokens to their IDs.
# (5) Pad or truncate the sentence to `max_length`
# (6) Create attention masks for [PAD] tokens.
encoded_dict = tokenizer.encode_plus(
sent, # Sentence to encode.
add_special_tokens=True, # Add '[CLS]' and '[SEP]'
max_length=args.max_seq_len, # Pad & truncate all sentences.
pad_to_max_length=True,
return_attention_mask=True, # Construct attn. masks.
return_tensors='tf', # Return tf tensors.
)
# Add the encoded sentence to the list.
input_ids.append(encoded_dict['input_ids'])
# And its attention mask (simply differentiates padding from non-padding).
attention_masks.append(encoded_dict['attention_mask'])
token_type_ids.append(encoded_dict['token_type_ids'])
input_ids = tf.convert_to_tensor(input_ids)
attention_masks = tf.convert_to_tensor(attention_masks)
token_type_ids = tf.convert_to_tensor(token_type_ids)
return input_ids, attention_masks, token_type_ids
Convert Data to Bert Inputs
train_inputs = [create_input_array(train[:], tokenizer=tokenizer, args=args)]
val_inputs = [create_input_array(val[:], tokenizer=tokenizer, args=args)]
For train_inputs, y_train and val_inputs, y_val I then apply the below function which reshapes and converts to numpy arrays. The returned list from this function is then passed as arguments to the keras fit method. I realise this is a bit overkill converting to tf.tensors then to numpy, but I don't think this has an impact. I was originally trying to use tf.datasets but switched to numpy.
def convert_inputs_to_tf_dataset(inputs,y, args):
# args.max_seq_len = 256
ids = inputs[0][1]
masks = inputs[0][1]
token_types = inputs[0][2]
ids = tf.reshape(ids, (-1, args.max_seq_len))
print("Input ids shape: ", ids.shape)
masks = tf.reshape(masks, (-1, args.max_seq_len))
print("Input Masks shape: ", masks.shape)
token_types = tf.reshape(token_types, (-1, args.max_seq_len))
print("Token type ids shape: ", token_types.shape)
ids=ids.numpy()
masks = masks.numpy()
token_types = token_types.numpy()
return [ids, masks, token_types, y]
Keras Model
# args.max_seq_len = 256
# n_classes = 6
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', trainable=True, num_labels=n_classes)
input_ids_layer = Input(shape=(args.max_seq_len, ), dtype=np.int32)
input_mask_layer = Input(shape=(args.max_seq_len, ), dtype=np.int32)
input_token_type_layer = Input(shape=(args.max_seq_len,), dtype=np.int32)
bert_layer = model([input_ids_layer, input_mask_layer, input_token_type_layer])[0]
flat_layer = Flatten()(bert_layer)
dropout= Dropout(0.3)(flat_layer)
dense_output = Dense(n_classes, activation='softmax')(dropout)
model_ = Model(inputs=[input_ids_layer, input_mask_layer, input_token_type_layer], outputs=dense_output)
Compile and Fit
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer='adam', loss=loss, metrics=[metric])
model.fit(inputs=..., outputs=..., validation_data=..., epochs=50, batch_size = 32, metrics=metric, verbose=1)
Epoch 32/50
1401/1401 [==============================] - 42s 30ms/sample - loss: 1.6103 - accuracy: 0.2327 - val_loss: 1.6042 -
val_accuracy: 0.2308
As I'm using BERT, only a few epochs are necessary, so I was expecting something much higher than 23% after 32 epochs.
x_trainandx_test. - todayids = inputs[0][1], theidsare the first element i.e. it should beids = inputs[0][0]. Further, you should fit theLabelEncoderonly one time, i.e. just usetransform(and notfit_transform) on validation labels. - todaytf.keras.optimizers.Adam(learning_rate=1e-4)(or even lower than that, say 1e-5). - today