I'm trying to train a model to classify if an answer answers the question given using this dataset.
I'm training in batches and using GloVe word embeddings. I train in batches of 1000 except the last one. The method I'm trying to use is to first giving the first sentence (question), and then the second sentence (answer) to LSTM and have it give me a number between 0 and 1 by using sigmoid function.
The problem is, loss always repeats itself after epoch 1. It never converges to the correct result which is if the answer belongs to the question, 1, 0 otherwise.
My code is as below:
class QandA(nn.Module):
def __init__(self, input_size, hidden_size):
super(QandA, self).__init__()
self.hidden_size = hidden_size
self.num_layers = 1
self.bidirectional = True
self.lstm = nn.LSTM(input_size, self.hidden_size, num_layers = self.num_layers, bidirectional = self.bidirectional)
self.lstm.to(device)
self.hidden2class = nn.Linear(self.hidden_size * 2, 1)
self.hidden2class.to(device)
def forward(self, glove_vec, glove_vec2):
# glove_vec.shape = (sentence_len, batch_size, 300)
output, hidden = self.lstm(glove_vec)
output, _ = self.lstm(glove_vec2, hidden)
# output.shape = (sentence_len, batch_size, hidden_size * 2)
output = self.hidden2class(output[-1,:,:])
# output.shape = (batch_size, 1)
return F.sigmoid(output)
model = QandA(300, 60).to(device)
loss_function = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)
Is my approach so wrong that it can't work in practice? Or is there any other problem that I'm overseeing?
edit: Extra code regarding the training;
batch_size = 1000
# load_dataset loads the data from the file.
questions, answers, outputs = load_dataset()
N = len(outputs)
losses = []
for epoch in range(10):
for batch in range(math.ceil(N / batch_size)):
model.zero_grad()
# get_data gets the data from the dataset (size batch_size, sequence batch)
input1, input2, targets = get_data(batch, batch_size)
class_pred = model(input1, input2)
loss = loss_function(class_pred, targets)
loss.backward()
optimizer.step()
lr=0.001
or evenlr=0.0001
and try it again as Adam usually requires smaller learning rates than vanilla SGD. – Jan