I'm using pytorch and I'm using the base pretrained bert to classify sentences for hate speech. I want to implement a Bi-LSTM layer that takes as an input all outputs of the latest transformer encoder from the bert model as a new model (class that implements nn.Module), and i got confused with the nn.LSTM parameters. I tokenized the data using
bert = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=int(data['class'].nunique()),output_attentions=False,output_hidden_states=False)
My data-set has 2 columns: class(label), sentence. Can someone help me with this? Thank you in advance.
Edit: Also, after processing the input in the bi-lstm, the network sends the final hidden state to a fully connected network that performs classication using the softmax activation function. how can I do that ?