7
votes

I'm using scikit-learn's RFECV class to perform feature selection. I'm interested in identifying the relative importance of a bunch of variables. However, scikit-learn returns the same ranking (1) for multiple variables. This can also be seen in their example code:

>>> from sklearn.datasets import make_friedman1
>>> from sklearn.feature_selection import RFECV
>>> from sklearn.svm import SVR
>>> X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
>>> estimator = SVR(kernel="linear")
>>> selector = RFECV(estimator, step=1, cv=5)
>>> selector = selector.fit(X, y)
>>> selector.support_ 
array([ True,  True,  True,  True,  True, False, False, False, False,
       False])
>>> selector.ranking_
array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])

Is there a way I can make scikit-learn also identify the relative importance between the top features?

I'm happy to increase the number of trees or similar if that's needed. Related to this, is there a way to see the confidence of this ranking?

1
I think this question would me more appropriate for stats.stackexchange.comThiago Barcala
Fair. I'd be happy to have it moved. If any moderators see this, please feel free to move it :)pir
I think most scikit-learn questions are on SO, so I think I'd keep it here.Andreas Mueller
You are using linear SVR, you can try changing it to 'poly'.cho_uc

1 Answers

6
votes

The goal of RFECV is to select the optimum number of features, so it does cross-validation over the number of features selected. In your case, it selected to keep 5 features. Then the model is refit on the whole data set until only 5 features remain. These are not removed, so they are not ranked in RFE.

You could get a ranking for all features by just running RFE

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFE(estimator, step=1, n_features_to_select=1)
selector = selector.fit(X, y)
selector.ranking_

array([ 4, 3, 5, 1, 2, 10, 8, 7, 6, 9])

You might ask yourself why the ranking from the cross-validation is not kept, which computed a ranking for all features. However, for each split in the cross-validation, the features might have been ranked differently. So alternatively RFECV could return 5 different rankings and you could compare them. That's not the interface, though (but would also be easy to accomplish with RFE and doing the cv yourself).

On a different note, this might not be the best way to compute the influence of the features and looking at coefficients directly or maybe permutation importance might be more informative.