1
votes

I am trying to predict y, a column of 0s and 1s (classification), using features (X). I'm using ML models like XGBoost.

One of my features, in reality, is highly predictive, let's call it X1. X1 is a column of -1/0/1. When X1 = 1, 80% of the time y = 1. When X1 = -1, 80% of the time y = 0. When X1 = 0, it has no correlation with y.

So in reality, ML aside, any sane person would select this in their model, because if you see X1 = 1 or X1 = -1 you have a 80% chance of predicting whether y is 0 or 1.

However X1 is only -1 or 1 about 5% of the time, and is 0 95% of the time. When I run it through feature selection techniques like Sequential Feature Selection, it doesn't get chosen! And I can understand why ML doesn't choose it, because 95% of the time it is a 0 (and thus uncorrelated with y). And so for any score that I've come across, models with X1 don't score well.

So my question is more generically, how can one deal with this paradox between ML technique and real-life logic? What can I do differently in ML feature selection/modelling to take advantage of the information embedded in the X1 -1's and 1's, which I know (in reality) are highly predictive? What feature selection technique would have spotted the predictive power of X1, if we didn't know anything about it? So far, all methods that I know of need predictive power to be unconditional. Instead, here X1 is highly predictive conditional on not being 0 (which is only 5% of the time). What methods are out there to capture this?

Many thanks for any insight!

1
The way I see it, the information gained through this feature is as compressed as it can be. If it doesn't seem to be a relevant predictor according to the feature selection used technique, then ignore it. If you know it is relevant, then use it. Perhaps the metrics used in the feature selection technique disregard the its predictive poweryatu
Well in the example I've given, it is clean X1 has predictive power. I dont think that can be disputed. If you see X1=1 or X1=-1 you have a 80% chance of successful prediction, even though that happens only 5% of the time. So the question is, how can an automated feature selection approach utilise this? There must be a way (maybe a different score, or a transformation of X1, or different approach to feature selection). To just manually select it doesn't answer the more general issue at hand. I don't think overriding the feature selection process is the answer to be honest.pp92391
I mean, until what point do you need to automate this pipeline of yours? If it is for this particular case only, I think this wound't be so bad. The thing is that you know of the predictive power of the feature. I'd suggest you to try different feature selection techniques, and see what feature importances you get. Then keep the one that suits your expectations the mostyatu
Right, so that is my question I guess. What feature selection technique would have spotted the predictive power of X1? Imagine I didn't know X1 was predictive. What could I do differently to have spotted that X1 in fact does have predictive power? So far, all methods I know need this predictive power to be unconditional. Instead, X1 is predictive conditional on not being 0 (which is 5% of the time). What methods are out there to capture this? That is my more general question.pp92391

1 Answers

1
votes

Probably sklearn.feature_selection.RFE would be a good option, since it is not really dependant on the feature selection method. What I mean by that, is that it recursively fits the estimator you're planning to use and smaller on smaller subsets of features, and recursively removes features with the lowest scores until a desired amount of features is reached.

This seems like a good appraoch, since regardless of whether the feature in question seems more or less of a good predictor to you, this feature selection method tells you how important the feature is to the model. So if a feature is not considered, it is not as relevant to the model in question.