5
votes

I am using sklearn to carry out recursive feature elimination with cross-validation, using the RFECV module. RFE involves repeatedly training an estimator on the full set of features, then removing the least informative features, until converging on the optimal number of features.

In order to obtain optimal performance by the estimator, I want to select the best hyperparameters for the estimator for each number of features(edited for clarity). The estimator is a linear SVM so I am only looking into the C parameter.

Initially, my code was as follows. However, this just did one grid search for C at the beginning, and then used the same C for each iteration.

from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn import svm, grid_search

def get_best_feats(data,labels,c_values):

    parameters = {'C':c_values}

    # svm1 passed to clf which is used to grid search the best parameters
    svm1 = SVC(kernel='linear')
    clf = grid_search.GridSearchCV(svm1, parameters, refit=True)
    clf.fit(data,labels)
    #print 'best gamma',clf.best_params_['gamma']

    # svm2 uses the optimal hyperparameters from svm1
    svm2 = svm.SVC(C=clf.best_params_['C'], kernel='linear')
    #svm2 is then passed to RFECVv as the estimator for recursive feature elimination
    rfecv = RFECV(estimator=svm2, step=1, cv=StratifiedKFold(labels, 5))      
    rfecv.fit(data,labels)                                                     

    print "support:",rfecv.support_
    return data[:,rfecv.support_]

The documentation for RFECV gives the parameter "estimator_params : Parameters for the external estimator. Useful for doing grid searches when an RFE object is passed as an argument to, e.g., a sklearn.grid_search.GridSearchCV object."

Therefore I want to try to pass my object 'rfecv' to the grid search object, as follows:

def get_best_feats2(data,labels,c_values):

    parameters = {'C':c_values   
    svm1 = SVC(kernel='linear')
    rfecv = RFECV(estimator=svm1, step=1, cv=StratifiedKFold(labels, 5), estimator_params=parameters)
    rfecv.fit(data, labels)

    print "Kept {} out of {} features".format((data[:,rfecv.support_]).shape[1], data.shape[1])


    print "support:",rfecv.support_
    return data[:,rfecv.support_]

X,y = get_heart_data()


c_values = [0.1,1.,10.]
get_best_feats2(X,y,c_values)

But this returns the error:

max_iter=self.max_iter, random_seed=random_seed)
File "libsvm.pyx", line 59, in sklearn.svm.libsvm.fit (sklearn/svm   /libsvm.c:1674)
TypeError: a float is required

So my question is: how can I pass the rfe object to the grid search in order to do cross-validation for each iteration of recursive feature elimination?

Thanks

1

1 Answers

6
votes

So you want to grid-search the C in the SVM for each number of features in the RFE? Or for each CV iteration in the RFECV? From your last sentence, I guess it is the former.

You can do RFE(GridSearchCV(SVC(), param_grid)) to achieve that, though I'm not sure that is actually a helpful thing to do.

I don't think the second is possible right now (but soon). You could do GridSeachCV(RFECV(), param_grid={'estimator__C': Cs_to_try}), but that nests two sets of cross-validation inside each other.

Update: GridSearchCV has no coef_, so the first one fails. A simple fix:

class GridSeachWithCoef(GridSearchCV):
    @property
    def coef_(self):
        return self.best_estimator_.coef_

And then use that instead.