I am comparing KFlold and RepeatedKFold using sklearn version 0.22. According to the documentation: RepeatedKFold "Repeats K-Fold n times with different randomization in each repetition." One would expect the results from running RepeatedKFold with only 1 repeat (n_repeats = 1) to be pretty much identical to KFold.
I ran a simple comparison:
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import StratifiedKFold, KFold, RepeatedKFold, RepeatedStratifiedKFold
from sklearn import metrics
X, y = load_digits(return_X_y=True)
classifier = SGDClassifier(loss='hinge', penalty='elasticnet', fit_intercept=True)
scorer = metrics.accuracy_score
results = []
n_splits = 5
kf = KFold(n_splits=n_splits)
for train_index, test_index in kf.split(X, y):
x_train, y_train = X[train_index], y[train_index]
x_test, y_test = X[test_index], y[test_index]
classifier.fit(x_train, y_train)
results.append(scorer(y_test, classifier.predict(x_test)))
print ('KFold')
print('mean = ', np.mean(results))
print('std = ', np.std(results))
print()
results = []
n_repeats = 1
rkf = RepeatedKFold(n_splits=n_splits, n_repeats = n_repeats)
for train_index, test_index in rkf.split(X, y):
x_train, y_train = X[train_index], y[train_index]
x_test, y_test = X[test_index], y[test_index]
classifier.fit(x_train, y_train)
results.append(scorer(y_test, classifier.predict(x_test)))
print ('RepeatedKFold')
print('mean = ', np.mean(results))
print('std = ', np.std(results))
The output is
KFold
mean = 0.9082079851439182
std = 0.04697225962068869
RepeatedKFold
mean = 0.9493562364593006
std = 0.017732595698953055
I repeated this experiment enough times to see that the difference is statistically significant.
I was trying to read and reread the documentation to see if I'm missing something but to no avail.
Btw, the same holds true for StratifiedKFold and RepeatedStratifiedKFold:
StratifiedKFold
mean = 0.9159935004642525
std = 0.026687786392525545
RepeatedStratifiedKFold
mean = 0.9560476632621479
std = 0.014405630805910506
For this data set, StratifiedKFold agrees with KFold; RepeatedStratifiedKFold agrees with RepeatedSKFold.
UPDATE Following the suggestion from @Dan and @SergeyBushmanov, I included shuffle and random_state
def run_nfold(X,y, classifier, scorer, cv, n_repeats):
results = []
for n in range(n_repeats):
for train_index, test_index in cv.split(X, y):
x_train, y_train = X[train_index], y[train_index]
x_test, y_test = X[test_index], y[test_index]
classifier.fit(x_train, y_train)
results.append(scorer(y_test, classifier.predict(x_test)))
return results
kf = KFold(n_splits=n_splits)
results_kf = run_nfold(X,y, classifier, scorer, kf, 10)
print('KFold mean = ', np.mean(results_kf))
kf_shuffle = KFold(n_splits=n_splits, shuffle=True, random_state = 11)
results_kf_shuffle = run_nfold(X,y, classifier, scorer, kf_shuffle, 10)
print('KFold Shuffled mean = ', np.mean(results_kf_shuffle))
rkf = RepeatedKFold(n_splits=n_splits, n_repeats = n_repeats, random_state = 111)
results_kf_repeated = run_nfold(X,y, classifier, scorer, rkf, 10)
print('RepeatedKFold mean = ', np.mean(results_kf_repeated)
produces
KFold mean = 0.9119255648406066
KFold Shuffled mean = 0.9505304859176724
RepeatedKFold mean = 0.950754100897555
Moreover, using Kolmogorov-Smirnov test:
print ('Compare KFold with KFold shuffled results')
ks_2samp(results_kf, results_kf_shuffle)
print ('Compare RepeatedKFold with KFold shuffled results')
ks_2samp(results_kf_repeated, results_kf_shuffle)
shows that KFold shuffled and RepeatedKFold (which looks it is is shuffled by default, you are right @Dan) are statistically the same, whereas the default non-shuffled KFold produces statistically significant lower result:
Compare KFold with KFold shuffled results
Ks_2sampResult(statistic=0.66, pvalue=1.3182765881237494e-10)
Compare RepeatedKFold with KFold shuffled results
Ks_2sampResult(statistic=0.14, pvalue=0.7166468440414822)
Now, note that I used different random_state for KFold and RepeatedKFold. So, the answer, or rather the partial answer, is that the difference in results is due to shuffling vs non-shuffling. Which makes sense, since using different random_state can change the exact split, and it shouldn't change the statistical properties, like the mean of multiple runs.
I'm now confused by why shuffling causes this effect. I've changed the title of the question to reflect this confusion ( I hope it doesn't break any stackoverflow rules, but I don't want to create another question).
UPDATE I agree with @SergeyBushmanov's suggestion. I posted it as a new question
random_state
seed? – Danshuffle=True
forKfold
. In your repeated experiments, did you get the same result forKFold
each time and a different result forRepeatedKFold
? – DanRepeatedKFold
usesKFold
underneath to generate folds. See the link to the code below in my answer. They are producing the same splits as long asrandom_seed
is the same. – Sergey Bushmanov