1
votes

I want to perform stratified 10-fold cross validation using sklearn. The train and test indices can be obtained using

from sklearn.model_selection import StratifiedKFold

kf = StratifiedKFold(n_splits=10)

for fold, (train_index, test_index) in enumerate(kf.split(X, y), 1):
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]

However, I would like to set not one, but two folds aside (one for tuning of hyperparameters). So, I want each iteration to consist of 8 folds for training, 1 for tuning and 1 for testing. Is this possible with sklearns StratifiedKFold? Or would I need to write a custom split method?

1

1 Answers

1
votes

You could use StratifiedShuffleSplit to further split the test set in a stratified way too:

from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit

kf = StratifiedKFold(n_splits=10)

for fold, (train_index, test_index) in enumerate(kf.split(X, y), 1):
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]

    #stratified split on the test set
    sss = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0)
    X_test_ix, X_tune_ix = next(sss.split(X_test, y_test))

    X_test_ = X_test[X_test_ix]
    y_test_ = y_test[X_test_ix]
    X_tune = X_test[X_tune_ix]
    y_tune = y_test[X_tune_ix]