1
votes

I'm trying to understand how I can use sklearn RFE for a linear regression model when I have categorical columns I created with pandas get_dummies

I have a dataset and the layout is:

y = Carsales
X = Unemployment, Queries, CPI_energy, CPI_all, Month(comes in as an int)

The first thing I do is convert the Month to object then category (converting straight to category type wasn't working in pandas).

df['MonthFac'] = df['Month'].astype('object')
df['MonthFac'] = df['MonthFac'].astype('category')

Then I create my X, y:

from sklearn.linear_model import LinearRegression
cols = ['Unemployment','Queries','CPI_energy','CPI_all']
X = pd.concat([train[cols],(pd.get_dummies(train['MonthFac']))], axis = 1)
y = train['ElantraSales'].values

lm1 = LinearRegression()

lm1.fit(X,y)

Then I want to use RFE:

from sklearn.feature_selection import RFE

selector = RFE(lm1,step=1, n_features_to_select = 2)
selector.fit(X,y)

Simple RFE looking for 2 features, however the result is that it ranks 2 of the month columns as 1, I technically only would need to only if 1 of the month columns was rank 1 then I would that the 'MonthFac' variable is significant in building my model, I want to know what is the other top ranked variable to use.

Or am I just supposed to use my deductive reasoning to figure out what other variable to use based on the selector.ranking_ output?

Compared to R, sklearn learning curve is a lot higher it seems.

Also am I doing Categorical values right in pandas/sklearn? In R, all I had to do was as.factor and BAM it did all of this.

One more question, if I wasn't sure what the optimum amount of features to I would think I could create a loop selector R^2/R^2 adj/MSE and print them out but since I have those additional month columns would my loop go to 16 cause there is essentially 16 features, is there a better way to do this?

1
Could you please accept my answer if that helps you solve your question? Many thanks. :-)Jianxun Li
My apologies, thanks again!Neil
Not a problem at all. :-) Glad that it helped.Jianxun Li

1 Answers

1
votes

For the first part of your question, each dummy variable is considered as a distinct feature (take your 12 month dummy as example, it would produce 11/12 dummy variable. Taking Jan as benchmark constant, the coefficients of the other 11 dummies tell you about whether a particular month has a different mean constant than Jan), so it make perfect sense that RFE selects two month dummy features for you.

However, since you use a default LinearRegression in RFE, and RFE use .coef_ to rank feature importances, you should set LinearRegression(normalize=True). Otherwise, selecting features based on linear regression coefficients is meaningless.