I am trying to predict y, a column of 0s and 1s (classification), using features (X). I'm using ML models like XGBoost.
One of my features, in reality, is highly predictive, let's call it X1. X1 is a column of -1/0/1. When X1 = 1, 80% of the time y = 1. When X1 = -1, 80% of the time y = 0. When X1 = 0, it has no correlation with y.
So in reality, ML aside, any sane person would select this in their model, because if you see X1 = 1 or X1 = -1 you have a 80% chance of predicting whether y is 0 or 1.
However X1 is only -1 or 1 about 5% of the time, and is 0 95% of the time. When I run it through feature selection techniques like Sequential Feature Selection, it doesn't get chosen! And I can understand why ML doesn't choose it, because 95% of the time it is a 0 (and thus uncorrelated with y). And so for any score that I've come across, models with X1 don't score well.
So my question is more generically, how can one deal with this paradox between ML technique and real-life logic? What can I do differently in ML feature selection/modelling to take advantage of the information embedded in the X1 -1's and 1's, which I know (in reality) are highly predictive? What feature selection technique would have spotted the predictive power of X1, if we didn't know anything about it? So far, all methods that I know of need predictive power to be unconditional. Instead, here X1 is highly predictive conditional on not being 0 (which is only 5% of the time). What methods are out there to capture this?
Many thanks for any insight!