0
votes

I'm trying to train a model to classify if an answer answers the question given using this dataset.

I'm training in batches and using GloVe word embeddings. I train in batches of 1000 except the last one. The method I'm trying to use is to first giving the first sentence (question), and then the second sentence (answer) to LSTM and have it give me a number between 0 and 1 by using sigmoid function.

The problem is, loss always repeats itself after epoch 1. It never converges to the correct result which is if the answer belongs to the question, 1, 0 otherwise.

My code is as below:

class QandA(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(QandA, self).__init__()

        self.hidden_size = hidden_size
        self.num_layers = 1
        self.bidirectional = True

        self.lstm = nn.LSTM(input_size, self.hidden_size, num_layers = self.num_layers, bidirectional = self.bidirectional)
        self.lstm.to(device)
        self.hidden2class = nn.Linear(self.hidden_size * 2, 1)
        self.hidden2class.to(device)

    def forward(self, glove_vec, glove_vec2):
        # glove_vec.shape = (sentence_len, batch_size, 300)
        output, hidden = self.lstm(glove_vec)
        output, _ = self.lstm(glove_vec2, hidden)
        # output.shape = (sentence_len, batch_size, hidden_size * 2)
        output = self.hidden2class(output[-1,:,:])
        # output.shape = (batch_size, 1)
        return F.sigmoid(output)
model = QandA(300, 60).to(device)
loss_function = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)

Is my approach so wrong that it can't work in practice? Or is there any other problem that I'm overseeing?

edit: Extra code regarding the training;

batch_size = 1000
# load_dataset loads the data from the file.
questions, answers, outputs = load_dataset()
N = len(outputs)
losses = []
for epoch in range(10):
    for batch in range(math.ceil(N / batch_size)):
        model.zero_grad()

        # get_data gets the data from the dataset (size batch_size, sequence batch)
        input1, input2, targets = get_data(batch, batch_size)

        class_pred = model(input1, input2)
        loss = loss_function(class_pred, targets)
        loss.backward()
        optimizer.step()
1
I'm not an expert in NLP but from the code you have given, my first suggestion would be to lower the learning rate to maybe lr=0.001 or even lr=0.0001 and try it again as Adam usually requires smaller learning rates than vanilla SGD.Jan
@Jan I tried changing the lr as you suggested, here are the results, I don't think it changed anything. But the loss looks more organic now which I guess is a good thing.Fethbita
How many training samples do you have in total? Have you tried way smaller batches (i.e. 16/32/64)? How about the distribution of samples? Is there an even distribution between correct answers (1) and incorrect answers (0)? Do you have training/validation/test splitting? Are the distributed evenly?dennlinger
@dennlinger There are around 20k samples. I have not tried in smaller batches, I will do it now. The distribution is, q1-a1-0, q1-a2-0, q1-a3-1, q2-a1-0 and so on. The question repeats until the correct answer is given in the set (sometimes there might be more than one answer as well). So the correct answers are much more less than incorrect ones. I do have train/dev/test splitting, but I've not used dev or test samples yet because it looked like there was a problem in the training. The distribution is similar between splits.Fethbita
What you could then try and do is to reduce the number of negative samples you are using, i.e. only use parts of the incorrect answers, to create an artificial balance between the classes. First make sure that you are in fact getting bad results due to this imbalance, though. One way to see this would be a confusion matrix.dennlinger

1 Answers

0
votes

I would suggest to encode question and answer independently and put a classifier on top of it. For example, you can encode with biLSTM question and answer, concatenate their representations and feed to the classifier. The code could be something like this (not tested, but hope you got the idea):

class QandA(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(QandA, self).__init__()

        self.hidden_size = hidden_size
        self.num_layers = 1
        self.bidirectional = True

        self.lstm_question = nn.LSTM(input_size, self.hidden_size, num_layers = self.num_layers, bidirectional = self.bidirectional)
        self.lstm_question.to(device)
        self.lstm_answer = nn.LSTM(input_size, self.hidden_size, num_layers = self.num_layers, bidirectional = self.bidirectional)
        self.lstm_answer.to(device)
        self.fc = nn.Linear(self.hidden_size * 4, 1)
        self.fc.to(device)

    def forward(self, glove_question, glove_answer):
        # glove.shape = (sentence_len, batch_size, 300)
        question_last_hidden, _ = self.lstm_question(glove_question)
        # question_last_hidden.shape = (question_len, batch_size, hidden_size * 2)
        answer_last_hidden, _ = self.lstm_answer(glove_answer)
        # answer_last_hidden.shape = (answer_len, batch_size, hidden_size * 2)

        # flatten output of the lstm, if you have multiple lstm layers you need to take only the last layers backward/forward hidden states
        question_last_hidden = question_last_hidden[-1,:,:]
        answer_last_hidden = answer_last_hidden[-1,:,:]
        representation = torch.cat([question_last_hidden, answer_last_hidden], -1) # check here to concatenate over feature size
        # representation.shape = (hidden_size * 4, batch_size)
        output = self.fc(representation)
        # output.shape = (batch_size, 1)
        return F.sigmoid(output)