scikit-learn: Custom feature selection within each CV fold

Question

I would like to do a random search over a large hyperparameter grid. One of the hyperparameters I’d like to optimize is feature selection. scikit-learn provides some very useful functionality for this, like the RFECV class, but this is not compatible with all models, since some don’t expose coef_ or feature_importances_ attributes. So I would like to compare RFECV with univariate feature selection. In particular, I would like to keep all features with associations to my dependent variable that are statistically significant at uncorrected p < 0.05 in univariate analyses. However, my modelling strategy for the data is fairly complex, such that it’s not an option to use one of the existing scikit-learn classes like SelectKBest or SelectFdr to apply a simple univariate statistical test. At the same time, I am wary of simply pre-calculating the significant univariate associations on the entire dataset because this seems to mix training and test data.

The easiest way to address this that I can see is to pre-calculate the significant univariate associations for the subset of the data in each cross-validation split, and then implement a custom feature selection function that reads these from a text file. I understand from this question that I can create a custom feature selection object that takes the cross-validation object in its constructor:

class ExternalSelector():
    """
    Univariate feature selection by reading pre-calculated results
    for each CV split. 
    """

    def __init__(self, cv):
        self.cv = cv
        self.feature_subset = None

    def transform(self, X, y=None, **kwargs):
        split_idx = 0
        for train_idxs, test_idxs in cv:
            # read the file

            # subset X

            split_idx = split_idx + 1

    def fit(self, X, y=None):
        return self

    def get_params(self):

... but reviewing sklearn's univariate feature selection source code, I can’t figure out how or even whether it’s possible to return a list of Xs for each split.

How can I implement a custom feature selection function that reads a different list of features for each cross-validation split?

Mohsin hasan Mohsin hasan · Accepted Answer · 2019-08-07T19:00:44

Check out GenericUnivariateSelect, it seems ideal for your case.

Here is an example of how you can use it in CV:

from sklearn.feature_selection import GenericUnivariateSelect, f_classif
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline


X = np.array([[1, 1, 0],
             [1, 0, 0],
             [0, 1, 1],
             [0, 0, 1],
             [0, 0, 0]])

Y = np.array([1, 1, 0, 0, 1])

cv = KFold(5, random_state=1).split(X)
feature_selector = GenericUnivariateSelect(f_classif, 'fwe', 0.05) # select p-value threshold of 0.05
model = LogisticRegression(solver='lbfgs')

pipe = Pipeline([
    ('feature', feature_selector),
    ('logreg', model)
])

for i, (train_idx, test_idx) in enumerate(cv):
  pipe.fit(X[train_idx], Y[train_idx])
  score = pipe.score(X[test_idx], Y[test_idx])
  print("Feature selected for fold {} is {}".format(i, pipe.named_steps['feature']._get_support_mask()))

Output:

# Feature selected for fold 0 is [False False  True]
# Feature selected for fold 1 is [False False  True]
# Feature selected for fold 2 is [False False  True]
# Feature selected for fold 3 is [False False  True]
# Feature selected for fold 4 is [ True False  True]

You can replace f_classif with your own function such that it returns scores and pvalues for all features

scikit-learn: Custom feature selection within each CV fold

2 Answers