I would like to do a random search over a large hyperparameter grid. One of the hyperparameters I’d like to optimize is feature selection. scikit-learn provides some very useful functionality for this, like the RFECV class, but this is not compatible with all models, since some don’t expose coef_ or feature_importances_ attributes. So I would like to compare RFECV with univariate feature selection. In particular, I would like to keep all features with associations to my dependent variable that are statistically significant at uncorrected p < 0.05 in univariate analyses. However, my modelling strategy for the data is fairly complex, such that it’s not an option to use one of the existing scikit-learn classes like SelectKBest or SelectFdr to apply a simple univariate statistical test. At the same time, I am wary of simply pre-calculating the significant univariate associations on the entire dataset because this seems to mix training and test data.
The easiest way to address this that I can see is to pre-calculate the significant univariate associations for the subset of the data in each cross-validation split, and then implement a custom feature selection function that reads these from a text file. I understand from this question that I can create a custom feature selection object that takes the cross-validation object in its constructor:
class ExternalSelector():
"""
Univariate feature selection by reading pre-calculated results
for each CV split.
"""
def __init__(self, cv):
self.cv = cv
self.feature_subset = None
def transform(self, X, y=None, **kwargs):
split_idx = 0
for train_idxs, test_idxs in cv:
# read the file
# subset X
split_idx = split_idx + 1
def fit(self, X, y=None):
return self
def get_params(self):
... but reviewing sklearn's univariate feature selection source code, I can’t figure out how or even whether it’s possible to return a list of Xs for each split.
How can I implement a custom feature selection function that reads a different list of features for each cross-validation split?