Feature selection in scikit-learn for large number of features

Question

I am a beginner in Machine Learning. I am doing a binary classification based on 49 features. The first 7 features are of float64 type. Next 18 features are of multiclass type and the rest are of binary class type i.e. 0 or 1. I performed feature-selection using the following code

model = ExtraTreesClassifier()
model.fit(x_new, y)
print(model.feature_importances_)

The output of the above was

[  1.20621145e-01   3.71627370e-02   1.82239903e-05   5.40071522e-03
   1.77431957e-02   8.40569119e-02   1.74562937e-01   5.00468692e-02
   7.60565780e-03   1.78975490e-01   4.30178009e-03   7.44005584e-03
   3.46208406e-02   1.67869557e-03   2.94863800e-02   1.97333741e-02
   2.53116233e-02   1.30663822e-02   1.14032351e-02   3.98503442e-02
   3.48701630e-02   1.93366039e-02   5.89310510e-03   3.17052801e-02
   1.47389909e-02   1.54041443e-02   4.94699885e-03   2.27428191e-03
   1.27218776e-03   7.39305898e-04   3.84357333e-03   1.59161363e-04
   1.31479740e-03   0.00000000e+00   5.24038196e-05   9.92543746e-05
   2.27356615e-04   0.00000000e+00   1.29338508e-05   4.98412036e-06
   2.97697346e-06   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   1.49018368e-05   0.00000000e+00   0.00000000e+00
   0.00000000e+00]

As none of them were significant I tried it on a subset of 18 features of multiclass type and the following was the output

[ 0.06456545  0.01254671  0.32220959  0.00552464  0.02017919  0.07311639
  0.00716867  0.06964389  0.04797752  0.06608452  0.02915153  0.02044009
  0.05146265  0.05712569  0.09264365  0.01252251  0.01899865  0.02863864]

Including all features degrades contribution of every feature in the classification but cannot eliminate any one. Should I eliminate the features with relatively lower score?What is the inference of the results above?

Using Scikit-Learn with Python 3.

Venkataraman Venkataraman · Accepted Answer · 2017-12-29T11:04:15

u can use sklearn.feature_selection.RFECV

model=ExtraTreesClassifier()
model=RFECV(model,cv=3)
model.fit(features_train,label_train)

This will automatically select the best features by cross-validation and finding their importance in classification.

The model has attributes

n_features_: The number of selected features with cross-validation.

support_ : The mask of selected features. Gives a array of True and False based on index. The considered ones are True and Neglected ones are false

ranking_: The feature ranking. The considered ones are give rank 1 and rest other values.

Refer: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html#sklearn.feature_selection.RFECV

Feature selection in scikit-learn for large number of features

3 Answers