I'm trying to understand how I can use sklearn RFE for a linear regression model when I have categorical columns I created with pandas get_dummies
I have a dataset and the layout is:
y = Carsales
X = Unemployment, Queries, CPI_energy, CPI_all, Month(comes in as an int)
The first thing I do is convert the Month to object then category (converting straight to category type wasn't working in pandas).
df['MonthFac'] = df['Month'].astype('object')
df['MonthFac'] = df['MonthFac'].astype('category')
Then I create my X, y:
from sklearn.linear_model import LinearRegression
cols = ['Unemployment','Queries','CPI_energy','CPI_all']
X = pd.concat([train[cols],(pd.get_dummies(train['MonthFac']))], axis = 1)
y = train['ElantraSales'].values
lm1 = LinearRegression()
lm1.fit(X,y)
Then I want to use RFE:
from sklearn.feature_selection import RFE
selector = RFE(lm1,step=1, n_features_to_select = 2)
selector.fit(X,y)
Simple RFE looking for 2 features, however the result is that it ranks 2 of the month columns as 1, I technically only would need to only if 1 of the month columns was rank 1 then I would that the 'MonthFac' variable is significant in building my model, I want to know what is the other top ranked variable to use.
Or am I just supposed to use my deductive reasoning to figure out what other variable to use based on the selector.ranking_
output?
Compared to R, sklearn learning curve is a lot higher it seems.
Also am I doing Categorical values right in pandas/sklearn? In R, all I had to do was as.factor
and BAM it did all of this.
One more question, if I wasn't sure what the optimum amount of features to I would think I could create a loop selector R^2/R^2 adj/MSE and print them out but since I have those additional month columns would my loop go to 16 cause there is essentially 16 features, is there a better way to do this?