0
votes

I would like to do a random search over a large hyperparameter grid. One of the hyperparameters I’d like to optimize is feature selection. scikit-learn provides some very useful functionality for this, like the RFECV class, but this is not compatible with all models, since some don’t expose coef_ or feature_importances_ attributes. So I would like to compare RFECV with univariate feature selection. In particular, I would like to keep all features with associations to my dependent variable that are statistically significant at uncorrected p < 0.05 in univariate analyses. However, my modelling strategy for the data is fairly complex, such that it’s not an option to use one of the existing scikit-learn classes like SelectKBest or SelectFdr to apply a simple univariate statistical test. At the same time, I am wary of simply pre-calculating the significant univariate associations on the entire dataset because this seems to mix training and test data.

The easiest way to address this that I can see is to pre-calculate the significant univariate associations for the subset of the data in each cross-validation split, and then implement a custom feature selection function that reads these from a text file. I understand from this question that I can create a custom feature selection object that takes the cross-validation object in its constructor:

class ExternalSelector():
    """
    Univariate feature selection by reading pre-calculated results
    for each CV split. 
    """

    def __init__(self, cv):
        self.cv = cv
        self.feature_subset = None

    def transform(self, X, y=None, **kwargs):
        split_idx = 0
        for train_idxs, test_idxs in cv:
            # read the file

            # subset X

            split_idx = split_idx + 1

    def fit(self, X, y=None):
        return self

    def get_params(self):

... but reviewing sklearn's univariate feature selection source code, I can’t figure out how or even whether it’s possible to return a list of Xs for each split.

How can I implement a custom feature selection function that reads a different list of features for each cross-validation split?

2

2 Answers

1
votes

Check out GenericUnivariateSelect, it seems ideal for your case.

Here is an example of how you can use it in CV:

from sklearn.feature_selection import GenericUnivariateSelect, f_classif
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline


X = np.array([[1, 1, 0],
             [1, 0, 0],
             [0, 1, 1],
             [0, 0, 1],
             [0, 0, 0]])

Y = np.array([1, 1, 0, 0, 1])

cv = KFold(5, random_state=1).split(X)
feature_selector = GenericUnivariateSelect(f_classif, 'fwe', 0.05) # select p-value threshold of 0.05
model = LogisticRegression(solver='lbfgs')

pipe = Pipeline([
    ('feature', feature_selector),
    ('logreg', model)
])

for i, (train_idx, test_idx) in enumerate(cv):
  pipe.fit(X[train_idx], Y[train_idx])
  score = pipe.score(X[test_idx], Y[test_idx])
  print("Feature selected for fold {} is {}".format(i, pipe.named_steps['feature']._get_support_mask()))

Output:

# Feature selected for fold 0 is [False False  True]
# Feature selected for fold 1 is [False False  True]
# Feature selected for fold 2 is [False False  True]
# Feature selected for fold 3 is [False False  True]
# Feature selected for fold 4 is [ True False  True]

You can replace f_classif with your own function such that it returns scores and pvalues for all features

0
votes

I did ultimately figure out a solution that won't win any style points but does work for my application. I held the splits constant and pre-calculated the significant univariate results in a different language (R). I then wrote a custom feature selection function that infers the index of the current splits based on the indices of the observations (rows) in the current training split (X), given the entire dataset (Xall). The index of the current split is then used to read the precalculated features for that particular split in from a file.

class PrecalculatedSelector():
    """
    Univariate feature selection by reading pre-calculated results
    for each split. 
    """

    def __init__(self, cv, Xall, yall):
        self.cv = cv
        self.Xall = Xall
        self.yall = yall
        self.features = None

    def transform(self, X, y=None, **kwargs):
        return X[self.features]

    def fit(self, X, y=None):
        # infer split index from sample indices
        samples = list(X.index)
        sample_idxs = [idx for idx, item in enumerate(self.Xall.index) if \
                       item in samples]
        counter = 0
        split_idx = -1
        for train_idxs, test_idxs in self.cv.split(self.Xall, self.yall):
            counter += 1
            if list(train_idxs) == sample_idxs:
                split_idx = counter
                break

        # read univariate results from file
        feature_dir = ...
        feature_file = feature_dir + "/split-{}.csv".\
            format(split_idx)
        with open(feature_file, 'r') as f:
            self.features = [line.strip() for line in f.readlines()]

        return self