Recursive feature elimination with cross validation for regression in scikit-learn

Question

I want to apply a wrapper-method like Recursive Feature Elimination on my regression problem with scikit-learn. Recursive feature elimination with cross-validation gives a good overview, how to tune the number of features automatically.

I tried this:

modelX = LogisticRegression()
rfecv = RFECV(estimator=modelX, step=1, scoring='mean_absolute_error')
rfecv.fit(df_normdf, y_train)
print("Optimal number of features : %d" % rfecv.n_features_)

# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()`

but I receive an error message like

`The least populated class in y has only 1 members, which is too few. 
The minimum number of labels for any class cannot be less than n_folds=3. % (min_labels, self.n_folds)), Warning)

The warning sounds like I have a classification problem, but my task is a regression problem. What can I do to get a result and what's wrong?

My y_train has 1 column and ~10.000 rows with values between 1 and 200. — matthew
Are the values integers ? If so I think that it considers it as a multiclass classification problem. Try to cast the values to floats. — MMF
This make sense. I've casted the values to floats but the same warning occurs. — matthew

MMF MMF · Accepted Answer · 2016-11-16T13:26:22

Here is what happened :

By default, when the number of folds is not indicated by the user, the Cross-Validation of the RFE uses a 3-fold cross-validation. So far so good.

However, if you look at the documentation, it also uses StartifiedKFold which ensures that the folds are created by preserving the percentage of samples for each class. Therefore, since it seems (according to the error) that some elements of your output y are unique, they cannot be at the same time in 3 different folds. It throws an error !

The error comes from here.

You need then to use unstratified K-fold : KFold.

The documentation of RFECV says that: "If the estimator is a classifier or if y is neither binary nor multiclass, sklearn.model_selection.KFold is used."

Recursive feature elimination with cross validation for regression in scikit-learn

1 Answers