1
votes

I am trying to use scikit-learn to make a classifier and then predict the accuracy of the classifier. My dataset is relatively small and I am unsure of the best parameters. Hence I turned to nested cross-validation (nCV) to make and test my model.

I have been trying to understand the best methodology. However after reading:

  1. https://stats.stackexchange.com/questions/229509/do-i-need-an-initial-train-test-split-for-nested-cross-validation
  2. https://stats.stackexchange.com/questions/410118/cross-validation-vs-train-validation-test/410206
  3. https://stats.stackexchange.com/questions/95797/how-to-split-the-dataset-for-cross-validation-learning-curve-and-final-evaluat

I am still at a loss as to the best way to proceed.

So far I have:

  1. Split (80%/20%) the entire data set into training and testing sets
  2. Defined my inner-cv, outer-cv, parameter grid and estimator (random forest)
  3. Run the nCV to get the mean accuracy score.

To do this, my code so far is:

X_train, X_test, Y_train, Y_test = train_test_split(X_res, Y_res, test_size=0.2)
inner_cv = KFold(n_splits=2, shuffle=True)
outer_cv = KFold(n_splits=2, shuffle=True)
rfc = RandomForestClassifier()
param_grid = {'bootstrap': [True, False],
              'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
              'max_features': ['auto', 'sqrt', 'log2', None],
              'min_samples_leaf': [1, 2, 4, 25],
              'min_samples_split': [2, 5, 10, 25],
              'criterion': ['gini', 'entropy'],
              'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}
rfclf = RandomizedSearchCV(rfc, param_grid, cv=inner_cv, n_iter=100, n_jobs=-1, scoring='accuracy', verbose=1)
nested_cv_results = cross_val_score(rfclf, X_train, Y_trin, cv=outer_cv, scoring = 'accuracy')

I now have 2 questions:

  1. How do I find the overall best model?
  2. How do I test this the best model against X_test and Y_test?
1

1 Answers

1
votes

Cross validation is used to assess model performance or to tune your hyper-parameters. Let's say you use CV to tune your hyper-parameters, you cannot use these CV scores to assess model performance, i.e., you get an overoptimistic estimate due to data leakage. This is where nested CV can help you. By adding an extra CV layer you prevent data leakage. Nested CV is therefore used to get an unbiased estimate of the model performance.

To answer your questions, after you have done your Nested CV on your X_train/y_train, you have obtained your unbiased estimate of the model performance. Next, tune your model hyper-parameters again using RandomizedSearchCV on your X_train/y_train. From this search get the best model and use it on your X_test/y_test.

Example code:

X_train, X_test, Y_train, Y_test = train_test_split(X_res, Y_res, test_size=0.2)
inner_cv = KFold(n_splits=2, shuffle=True)
outer_cv = KFold(n_splits=2, shuffle=True)
rfc = RandomForestClassifier()
param_grid = {'bootstrap': [True, False],
              'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
              'max_features': ['auto', 'sqrt', 'log2', None],
              'min_samples_leaf': [1, 2, 4, 25],
              'min_samples_split': [2, 5, 10, 25],
              'criterion': ['gini', 'entropy'],
              'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}
rfclf = RandomizedSearchCV(rfc, param_grid, cv=inner_cv, n_iter=100, n_jobs=-1, scoring='accuracy', verbose=1, refit=True)
nested_cv_results = cross_val_score(rfclf, X_train, Y_train, cv=outer_cv, scoring = 'accuracy')



random = RandomizedSearchCV(rfc, param_grid, cv=inner_cv, n_iter=100, n_jobs=-1, scoring='accuracy', verbose=1, refit=True)

random.fit(X_train, Y_train)
random.best_estimator_.score(X_test, Y_test)