I am using RFECV
for feature selection in scikit-learn. I would like to compare the result of a simple linear model (X,y
) with that of a log transformed model (using X, log(y)
)
Simple Model:
RFECV
and cross_val_score
provide the same result (we need to compare the average score of cross-validation across all folds with the score of RFECV
for all features: 0.66
= 0.66
, no problem, results are reliable)
Log Model:
the Problem: it seems that RFECV
does not provide a way to trasnform the y
. the scores in this case are 0.55
vs 0.53
. This is quite expected though, because I had to manually apply np.log
to fit the data: log_seletor = log_selector.fit(X,np.log(y))
. This r2 score is for y = log(y)
, with no inverse_func
, while what we need is a way to fit the model on the log(y_train)
and calculate the score using exp(y_test)
. Alternatively, if I try to use the TransformedTargetRegressor
, I get the error shown in the code: The classifier does not expose "coef_" or "feature_importances_" attributes
How do I resolve the problem and make sure that the feature selection process is reliable?
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.compose import TransformedTargetRegressor
import numpy as np
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = linear_model.LinearRegression()
log_estimator = TransformedTargetRegressor(regressor=linear_model.LinearRegression(),
func=np.log,
inverse_func=np.exp)
selector = RFECV(estimator, step=1, cv=5, scoring='r2')
selector = selector.fit(X, y)
###
# log_selector = RFECV(log_estimator, step=1, cv=5, scoring='r2')
# log_seletor = log_selector.fit(X,y)
# #RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
###
log_selector = RFECV(estimator, step=1, cv=5, scoring='r2')
log_seletor = log_selector.fit(X,np.log(y))
print("**Simple Model**")
print("RFECV, r2 scores: ", np.round(selector.grid_scores_,2))
scores = cross_val_score(estimator, X, y, cv=5)
print("cross_val, mean r2 score: ", round(np.mean(scores),2), ", same as RFECV score with all features")
print("no of feat: ", selector.n_features_ )
print("**Log Model**")
log_scores = cross_val_score(log_estimator, X, y, cv=5)
print("RFECV, r2 scores: ", np.round(log_selector.grid_scores_,2))
print("cross_val, mean r2 score: ", round(np.mean(log_scores),2))
print("no of feat: ", log_selector.n_features_ )
Output:
**Simple Model**
RFECV, r2 scores: [0.45 0.6 0.63 0.68 0.68 0.69 0.68 0.67 0.66 0.66]
cross_val, mean r2 score: 0.66 , same as RFECV score with all features
no of feat: 6
**Log Model**
RFECV, r2 scores: [0.39 0.5 0.59 0.56 0.55 0.54 0.53 0.53 0.53 0.53]
cross_val, mean r2 score: 0.55
no of feat: 3