Scikit - Polynomial Regression with Booleans and feature selection

Question

I m trying to predict a Variable y from a set of features X where X at start are 36 features. I have two questions concerning this:

How to handle boolean-attributes (0,1) while creating polynomial features? It doesn't make sense to square them for example.

Code I Have so far:

poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(X_train)

How to make a feature selection for polynomial regression? Because creating polynomial features of degree 2 for 36 variables increases the size of X drasticly. Is there a Method to run a selection which returns the best model based on MSE for example?

Stergios Stergios · Accepted Answer · 2016-03-09T16:46:31

True, there is no point in taking the squares of boolean features. One solution is to use PolynomialFeatures with the option interaction_only=True so you'll only get their products. The product in the case of booleans is actually an AND. You may also write your own function to get other combinations like OR or XOR.
Depending on the number of original features, it may or may be not time-consuming to perform an exhaustive search over all possible feature combinations. I guess it's the latter case. Then you could:

a) use LASSO regression (or elastic net) that automatically performs variable selection

b) try tree-base methods for the same reason (e.g. random forest)

c) try some feature selection methods (e.g. chi-square)