0
votes

Is there a way to train a model using the train subset in 8 of the 10 Kfolds that kf = KFold(n_splits=10) that sklearn has implemented?.

I want to split my data into three subsets: training, validation, and testing (this can be done by using train_test_split twice I think...).

The training set is used to fit the model, the validation set is used to tune the parameters, the test set is used for assessment of the generalization error of the final model.

But I was wondering if there is a way to just train with 8 of the 10 folds and get an error/accuracy, validate it on 1 fold and finally test it in the last fold getting errors/accuracy for them too.

See below for my thinking:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
clf = tree.DecisionTreeClassifier(criterion = "entropy", max_depth = 3)
kf = KFold(n_splits=10, shuffle = False, random_state = 0) #define number of splits
kf.get_n_splits(X) #to check how many splits will be done.
for train, test in kf.split(X_train, y_train):
2
what exactly are you meaning with 'tune the parameters'? Do you mean hyperparameters of your classifier? - pythonic833

2 Answers

0
votes

From your question, what I understood is that you want to leave out one or more of your subsets. In that case, you can leave one or more subsets of data using Leave One Out (LOO) or Leave P Out (LPO).

0
votes

you should change this line

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

to

X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=1)

to get exactly what you want. The first train_test_split is splitting in 0.8,0.2 for train, test. The next is splitting the 0.2 in 0.1,0.1 test, val.

Then:

model.fit(X_train, y_train)
print(sklearn.metrics.classification_report(model.predict(X_val, y_val))) 

And based on this report you could check if you proceed with the test data or change the hyperparameters in order to have higher scores on the validation set.