2
votes

My dataset has 32 categorical variable, and one numerical continous variable(sales_volume)

First I transformed categorical variables to binary with one-hot encoding (pd.get_dummies) and now I have 1294 columns since every column has several categorical variable.

Now I want to reduce them before using any dimensional reduction techniques.

  1. What is the best option to select the most effective variables?

  2. For example; one categorical variable has two answers 'yes' and 'no'. Is it possible to 'yes' column has significant importance and 'no' column has nothing to explain? Would you drop the question('yes' and 'no' columns) or just 'no' column?

Thanks in advance.

1
what are you trying to predict using these 32 binary variables + 1 continous - are you predicting binary too?MEdwin
I won't build a prediction model, actually in the end I'll build a clustering model probably with K-means. But for feature importance 1 continous variable can be used, since it is the sales volume, and the other binary variables are the features of point of sales ( such as location, consumer type, etc.)Tyr
aha... i get you. Clustering is unsupervised. so you should look out for Principal Feature Analysis.MEdwin

1 Answers

1
votes

On sklearn you could use sklearn.feature_selection.SelectFromModel which enables you to fit a model to all your features and pick only the features that have more importance in that model, for example a RandomForest. The get_support() method then gets you the important features.

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

clf = RandomForestClassifier()
sfm = SelectFromModel(clf)
sfm.fit(X,y)

sfm.get_support()