Well, first, let's point it out that RFECV and RFE are doing two separate jobs in your script: the former is selecting the optimal number of features, while the latter is selecting the most five important features (or, the best combination of 5 features, given their importance for the DecisionTreeRegressor).
Back to your question: "When did the RFE pass the information about which features have been selected to the Decision Tree?" It is worth noting that the RFE does not explicitly tell the Decision Tree which features are selected. Simply, it takes a matrix as input (the training set) and transforms it in a matrix of N columns, based on the n_features_to_select=N
parameter.
That matrix (i.e., transformed training set) is passed as input to the Decision Tree, along with the target variable, which returns a fitted model that can be used to predict unseen instances.
Let's dive into an example for classification:
""" Import dependencies and load data """
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import RFE
from sklearn.metrics import precision_score
from sklearn.tree import DecisionTreeClassifier
X, y = load_breast_cancer(return_X_y=True)
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=2)
We have now loaded the breast_cancer dataset and instantiated a RFE object (I used a DecisionTreeClassifier, but other algorithms can be used as well).
To see how the training data is handled within a pipeline, let's start with a manual example that show how a pipeline would works if decomposed in its "basic steps":
from sklearn.model_selection import train_test_split
def test_and_train(X, y, random_state):
# For simplicity, let's use 80%-20% splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)
# Fit and transform the training data by applying Recursive Feature Elimination
X_train_transformed = rfe.fit_transform(X_train, y_train)
# Transform the testing data to select the same features
X_test_transformed = rfe.transform(X_test)
print(X_train[0:3])
print(X_train_transformed[0:3])
print(X_test_transformed[0:3])
# Train on the transformed trained data
fitted_model = DecisionTreeClassifier().fit(X_train_transformed, y_train)
# Predict on the transformed testing data
y_pred = fitted_model.predict(X_test_transformed)
print('True labels: ', y_test)
print('Predicted labels:', y_pred)
return y_test, y_pred
precisions = list() # to store the precision scores (can be replaced by any other evaluation measure)
y_test, y_pred = test_and_train(X, y, 42)
precisions.append(precision_score(y_test, y_pred))
y_test, y_pred = test_and_train(X, y, 84)
precisions.append(precision_score(y_test, y_pred))
y_test, y_pred = test_and_train(X, y, 168)
precisions.append(precision_score(y_test, y_pred))
print('Average precision:', np.mean(precisions))
"""
Average precision: 0.92
"""
In the above script, we created a function that, given a dataset X
and a target variable y
- Creates a training and testing set following the 80%-20% splitting rule.
- Transforms them using RFE (i.e., selects the best 2 features, as specified in the former code snippet). While calling
fit_transform
on the RFE, it runs the Recursive Feature Elimination, and it saves information about the selected features in its object state. To know which are the selected features, call rfe.support_
.
Note: on the testing set only transform is executed, so that the features in rfe.support_
are used to filter out other features from the testing set.
- Fits a model and return a tuple (y_test, y_pred).
The y_test
and y_pred
can be used to analyze the performance of the model, e.g., its precision.
The precision in saved in an array, and the procedure is repeated 3 times.
Finally, we print the average precision.
We simulated a cross-validation procedure, by splitting the original data 3 times in their respective training and testing set, fitted a model, computed and averaged its performance (i.e., precision) across the three folds.
This process can be simplified using a RepeatedKFold validation:
from sklearn.model_selection import RepeatedKFold
precisions = list()
rkf = RepeatedKFold(n_splits=2, n_repeats=3, random_state=1)
for train_index, test_index in rkf.split(X, y):
# print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
X_train_transformed = rfe.fit_transform(X_train, y_train)
X_test_transformed = rfe.transform(X_test)
fitted_model = DecisionTreeClassifier().fit(X_train_transformed, y_train)
y_pred = fitted_model.predict(X_test_transformed)
precisions.append(precision_score(y_test, y_pred))
print('Average precision:', np.mean(precisions))
"""
Average precision: 0.93
"""
and even further with Pipeline:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
rkf = RepeatedKFold(n_splits=2, n_repeats=3, random_state=1)
pipeline = Pipeline(steps=[('s',rfe),('m',DecisionTreeClassifier())])
precisions = cross_val_score(pipeline, X, y, scoring='precision', cv=rkf)
print('Average precision:', np.mean(precisions))
"""
Average precision: 0.93
"""
In summary, when the original data is passed to the Pipeline, the latter:
- splits it in training and testing data;
- calls
RFE.fit_transform()
on the training data;
- applies
RFE.transform()
on the testing data so that it consists of the same features;
- calls
estimator.fit()
on the training data to fit (i.e., train) a model;
- calls
estimator.predict()
on the testing data to predict it.
- compares the predictions with the actual values and save the performance results (the one you passed to the
scoring
parameter) internally.
- Repeats steps 1-6 for every split in the cross-validation procedure
At the end of the procedure, someone can access the performance results and average them across the folds.