2
votes

I am working on a binary classification problem and would like to perform the nested cross validation to assess the classification error. The reason why I'm doing the nested CV is due to the small sample size (N_0 = 20, N_1 = 10), where N_0, N_1 are the numbers of instances in 0 and 1 classes respectively.

My code is quite simple:

>> pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', LogisticRegression(penalty='l1'))])
>> parameters = {'clf__C': logspace(-4,1,50)}
>> grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='f1', cv=5)
>> cross_val_score(grid_search, X, y, cv=5)

So far, so good. If I want to change the CV scheme (from random splitting to StratifiedShuffleSplit in both, outer and inner CV loops, I face the problem: how can I pass the class vector y, as it is required by the StratifiedShuffleSplit function?

Naively:

>> grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='f1', cv=StratifiedShuffleSplit(y_inner_loop, 5, test_size=0.5, random_state=0))
>> cross_val_score(grid_search, X, y, cv=StratifiedShuffleSplit(y, 5, test_size=0.5, random_state=0))

So, the problem is how to specify the y_inner_loop ?

** My data set is slightly imbalanced (20/10) and I would like to keep this splitting ratio for training and assessing the model.

1
Somewhat off topic, but @arnold-klein , I'm struggling to understand how this implements nested CV -- could you give me any pointers to understand this code? - ScottEdwards2000
@ScottEdwards2000, please see my answer for a complete code snippet. The nested CV has two stages: the outer loop and the inner one. The outer loop (in my case it is 'sss_outer'), splits your entire data set into 5 chunks (BTW, not necessarily you should use 'StratifiedShuffleSplit'), then the inner loop splits EACH chunk into 3 chunks, which it iterates as training and test sets (as usual CV does). The nested CV in my code is implemented in a single line: '>> cross_val_score(grid_search, X, y, cv=sss_outer)', it evaluates our model 5 times (as sss_outer ~ 5). - Arnold Klein
the hyper parameter C (inverse regularization strength) is tuned on the inner cross validation, which is defined through the 'sss_inner' and passed as a parameter in 'grid_search' function: '' >> grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='f1', cv=sss_inner)'' So, effectively, ''cross_val_score" evaluates 5 times (5 surrogated models) models which are tuned on the inner CV loop in ''grid_search''. - Arnold Klein
In addition, I would highly recommend to read the book of Sebastian Raschka on Machine Learning with Python - Arnold Klein
@ArnoldKlein In your first comment - "splits your entire data set into 5 chunks (BTW, not necessarily you should use 'StratifiedShuffleSplit'), then the inner loop splits EACH chunk into 3 chunks, which it iterates as training and test sets (as usual CV does)" I think you meant not EACH chunk but the dividing the remaining 4 chunk of the outer loop into 3 chunks in the inner loop? At least, that's how I think nested cv works. - nafizh

1 Answers

3
votes

So far, I resolved this problem which might be of interested to some novices to ML. In the newest version of the scikit-learn 0.18, cross validated metrics have moved to sklearn.model_selection module and have changed (slightly) their API. Making long story short:

>> from sklearn.model_selection     import StratifiedShuffleSplit
>> sss_outer = StratifiedShuffleSplit(n_splits=5, test_size=0.4, random_state=15)
>> sss_inner = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=16)
>> pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', LogisticRegression(penalty='l1'))])
>> parameters = {'clf__C': logspace(-4,1,50)}
>> grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='f1', cv=sss_inner)
>> cross_val_score(grid_search, X, y, cv=sss_outer)

UPD in the newest version, we do not need to specify explicitly the target vector ("y", which was my problem initially), but rather only the number of desired splits.