I am trying to understand how to read grid_scores_
and ranking_
values in RFECV. Here is the main example from the documentation:
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFECV(estimator, step=1, cv=5)
selector = selector.fit(X, y)
selector.support_
array([ True, True, True, True, True,
False, False, False, False, False], dtype=bool)
selector.ranking_
array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])
How am I supposed to read ranking_
and grid_scores_
? Is the lower the ranking value the better? (or viceversa?). The reason why ask this is because I have noticed that the features with the highest ranking value, have typically the highest scores in grid_scores_
.
However, if something has a ranking = 1
shouldn't that mean that it was ranked as the best of the group?. This is also what the documentation says:
"Selected (i.e., estimated best) features are assigned rank 1"
But now let's look at the following example using some real data:
> rfecv.grid_scores_[np.nonzero(rfecv.ranking_ == 1)[0]]
0.0
while the feature with the highest ranking value has the highest *score*.
> rfecv.grid_scores_[np.argmax(rfecv.ranking_ )]
0.997
Note that in the example above, the features with ranking=1 have the lowest score
Figure in the documentation:
On this matter, in this figure in the documentation, the y
axis reads "number of misclassifications"
, but it is plotting grid_scores_
which used 'accuracy'
(?) as a scoring function. Shouldn't the y
label read accuracy
? (the higher the better) instead of "number of misclassifications"
(the lower the better)