Is there a way to use bert-large as a text classification tool without fine-tuning?

Question

I'm currently have a task of converting a keras BERT-based model for any text classification problem to the .pb file. For this I already have a function, that takes in the keras model, but the point is that when I'm trying to download any pre-trained versions of BERT they always end up without any top layers for classification, hence I should manually add tf.keras.layers.Input layers before and any neural network architecture above the BERT (after [CLS]'s embedding). My goal is ultimately escape the need for fine-tuning and get some ready model, that has already been fine-tuned. I've found that transformer library might be useful for this, as they have some BERT-based models ready for some datasets. Anyway, using the following code from their documentation gives back the tensor of shape 1 by number of tokens by hidden dimensionality.

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
model = TFBertModel.from_pretrained("bert-large-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

So, I eventually have to find some dataset and do fine-tuning. Even usage of models like distilbert-base-uncased-finetuned-sst-2-english still produce the embedding for each input token. Is there a way of getting ready model?

ML_Engine ML_Engine · Accepted Answer · 2021-04-06T10:39:10

You have to do some sort of additional work with BERT for text classification otherwise it won't know what labels you require classification to.

The simplest way to do this if you want to avoid fine-tuning is to do something like:

model = TFBertModel.from_pretrained("bert-large-uncased")
model.trainable = False 

input_ids_layer = Input(shape=(<max sequence length here>,))
attn_mask_layer = Input(shape=(<max sequence length here>,))
segment_ids_layer = Input(shape=(<max sequence length here>,))

layer = model([input_ids_layer, attn_mask_layer, segment_ids_layer])[0]
layer = Flatten()(layer)
layer = Dense(<num dense nodes here>, activation='relu')(layer)
output = Dense(<num classes here>, activation=<sigmoid/softmax>)(layer)

model = Model(inputs=[input_ids_layer, attn_mask_layer, segment_ids_layer], outputs=output)

Above we are using the BERT model but freezing everything by calling model.trainable=False, and then adding three input layers for each of the inputs that gets returned from using the tokenizer.

Next we are adding a Flatten layer to reshape the outputs of the bert layer to be able to pass to a dense layer with a custom number of nodes - this is what you will need to train. Finally we have another dense layer with the number of classes as the nodes, and then combining the entire pipeline into a Model layer to give you access to the fit/predict methods (etc).

Is there a way to use bert-large as a text classification tool without fine-tuning?

1 Answers