7
votes

I am new to pytorch and are trying to implement a feed forward neural network to classify the mnist data set. I have some problems when trying to use cross-validation. My data has the following shapes: x_train: torch.Size([45000, 784]) and y_train: torch.Size([45000])

I tried to use KFold from sklearn.

kfold =KFold(n_splits=10)

Here is the first part of my train method where I'm dividing the data into folds:

for  train_index, test_index in kfold.split(x_train, y_train): 
        x_train_fold = x_train[train_index]
        x_test_fold = x_test[test_index]
        y_train_fold = y_train[train_index]
        y_test_fold = y_test[test_index]
        print(x_train_fold.shape)
        for epoch in range(epochs):
         ...

The indices for the y_train_fold variable is right, it's simply: [ 0 1 2 ... 4497 4498 4499], but it's not for x_train_fold, which is [ 4500 4501 4502 ... 44997 44998 44999]. And the same goes for the test folds.

For the first iteration I want the varibale x_train_fold to be the first 4500 pictures, in other words to have the shape torch.Size([4500, 784]), but it has the shape torch.Size([40500, 784])

Any tips on how to get this right?

4

4 Answers

8
votes

I think you're confused!

Ignore the second dimension for a while, When you've 45000 points, and you use 10 fold cross-validation, what's the size of each fold? 45000/10 i.e. 4500.

It means that each of your fold will contain 4500 data points, and one of those fold will be used for testing, and the remaining for training i.e.

For testing: one fold => 4500 data points => size: 4500
For training: remaining folds => 45000-4500 data points => size: 45000-4500=40500

Thus, for first iteration, the first 4500 data points (corresponding to indices) will be used for testing and the rest for training. (Check below image)

Given your data is x_train: torch.Size([45000, 784]) and y_train: torch.Size([45000]), this is how your code should look like:

for train_index, test_index in kfold.split(x_train, y_train):  
    print(train_index, test_index)

    x_train_fold = x_train[train_index] 
    y_train_fold = y_train[train_index] 
    x_test_fold = x_train[test_index] 
    y_test_fold = y_train[test_index] 

    print(x_train_fold.shape, y_train_fold.shape) 
    print(x_test_fold.shape, y_test_fold.shape) 
    break 

[ 4500  4501  4502 ... 44997 44998 44999] [   0    1    2 ... 4497 4498 4499]
torch.Size([40500, 784]) torch.Size([40500])
torch.Size([4500, 784]) torch.Size([4500])

So, when you say

I want the variable x_train_fold to be the first 4500 picture... shape torch.Size([4500, 784]).

you're wrong. this size corresonds to x_test_fold. In the first iteration, based on 10 folds, x_train_fold will have 40500 points, thus its size is supposed to be torch.Size([40500, 784]).

K-fold validation image

8
votes

Think I have it right now, but I feel the code is a bit messy, with 3 nested loops. Is there any simpler way to it or is this approach okay?

Here's my code for the training with cross validation:

def train(network, epochs, save_Model = False):
    total_acc = 0
    for fold, (train_index, test_index) in enumerate(kfold.split(x_train, y_train)):
        ### Dividing data into folds
        x_train_fold = x_train[train_index]
        x_test_fold = x_train[test_index]
        y_train_fold = y_train[train_index]
        y_test_fold = y_train[test_index]

        train = torch.utils.data.TensorDataset(x_train_fold, y_train_fold)
        test = torch.utils.data.TensorDataset(x_test_fold, y_test_fold)
        train_loader = torch.utils.data.DataLoader(train, batch_size = batch_size, shuffle = False)
        test_loader = torch.utils.data.DataLoader(test, batch_size = batch_size, shuffle = False)

        for epoch in range(epochs):
            print('\nEpoch {} / {} \nFold number {} / {}'.format(epoch + 1, epochs, fold + 1 , kfold.get_n_splits()))
            correct = 0
            network.train()
            for batch_index, (x_batch, y_batch) in enumerate(train_loader):
                optimizer.zero_grad()
                out = network(x_batch)
                loss = loss_f(out, y_batch)
                loss.backward()
                optimizer.step()
                pred = torch.max(out.data, dim=1)[1]
                correct += (pred == y_batch).sum()
                if (batch_index + 1) % 32 == 0:
                    print('[{}/{} ({:.0f}%)]\tLoss: {:.6f}\t Accuracy:{:.3f}%'.format(
                        (batch_index + 1)*len(x_batch), len(train_loader.dataset),
                        100.*batch_index / len(train_loader), loss.data, float(correct*100) / float(batch_size*(batch_index+1))))
        total_acc += float(correct*100) / float(batch_size*(batch_index+1))
    total_acc = (total_acc / kfold.get_n_splits())
    print('\n\nTotal accuracy cross validation: {:.3f}%'.format(total_acc))
3
votes

You messed with indices.

x_train = x[train_index]
x_test = x[test_index]
y_train = y[train_index]
y_test = y[test_index]
    x_fold = x_train[train_index]
    y_fold = y_train[test_index]

It should be:

x_fold = x_train[train_index]
y_fold = y_train[train_index]
0
votes

Though all the above answers provide a good example of how to split the dataset, I am curious about the way to implement the K-fold cross-validation. K-fold aims to estimate the skill of a machine learning model on unseen data. To use a limited sample to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model. (See the concept and explanation in Wikipedia https://en.wikipedia.org/wiki/Cross-validation_(statistics)) Therefore, it is necessary to initialize the parameters of your to-be-trained model at the beginning of each fold. Otherwise, your model will see every sample in the dataset after K-fold and there is no such thing as validation (all are training samples).