5
votes

Inspired by this: http://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html#sphx-glr-auto-examples-feature-selection-plot-rfe-with-cross-validation-py

I am wondering if there is anyway to get the features for a particular score:

enter image description here

In that case, I would like to know, which 10 features selected gives that peak when #Features = 10.

Any ideas?

EDIT:

This is the code used to get that plot:

from sklearn.feature_selection import RFECV
from sklearn.model_selection import KFold,StratifiedKFold #for K-fold cross validation
from sklearn.ensemble import RandomForestClassifier #Random Forest

# The "accuracy" scoring is proportional to the number of correct classifications
#kfold = StratifiedKFold(n_splits=10, random_state=1) # k=10, split the data into 10 equal parts
model_Linear_SVM=svm.SVC(kernel='linear', probability=True)
rfecv = RFECV(estimator=model_Linear_SVM, step=1, cv=kfold,scoring='accuracy')   #5-fold cross-validation
rfecv = rfecv.fit(X, y)

print('Optimal number of features :', rfecv.n_features_)
print('Best features :', X.columns[rfecv.support_])
print('Original features :', X.columns)
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score \n of number of selected features")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
1
Could you please add the code that produced this figure?Vladislav Gladkikh
@VladislavGladkikh addedJaviss
You first need to make the process reproducible by setting the param random_state in everywhere you can find it. After that, you can extend the RFECV fit() method to print the features at each timeVivek Kumar

1 Answers

6
votes

First, you can see which features it selected where the cross validation score is the largest (in your case this corresponds to the number of features 17 or 21, I am not sure from the figure) with

rfecv.support_

or

rfecv.ranking_ 

Then you can calculate the importances of selected features (for the peak of the cv score curve) by

np.absolute(rfecv.estimator_.coef_)

for simple estimators or

rfecv.estimator_.feature_importances_ 

if your estimator is some ensemble, like random forest.

Then you can remove the least important feature one by one in the loop, and recalculate rfecv for the remaining feature sets.