2
votes

I am a beginner in Machine Learning. I am doing a binary classification based on 49 features. The first 7 features are of float64 type. Next 18 features are of multiclass type and the rest are of binary class type i.e. 0 or 1. I performed feature-selection using the following code

model = ExtraTreesClassifier()
model.fit(x_new, y)
print(model.feature_importances_)

The output of the above was

[  1.20621145e-01   3.71627370e-02   1.82239903e-05   5.40071522e-03
   1.77431957e-02   8.40569119e-02   1.74562937e-01   5.00468692e-02
   7.60565780e-03   1.78975490e-01   4.30178009e-03   7.44005584e-03
   3.46208406e-02   1.67869557e-03   2.94863800e-02   1.97333741e-02
   2.53116233e-02   1.30663822e-02   1.14032351e-02   3.98503442e-02
   3.48701630e-02   1.93366039e-02   5.89310510e-03   3.17052801e-02
   1.47389909e-02   1.54041443e-02   4.94699885e-03   2.27428191e-03
   1.27218776e-03   7.39305898e-04   3.84357333e-03   1.59161363e-04
   1.31479740e-03   0.00000000e+00   5.24038196e-05   9.92543746e-05
   2.27356615e-04   0.00000000e+00   1.29338508e-05   4.98412036e-06
   2.97697346e-06   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   1.49018368e-05   0.00000000e+00   0.00000000e+00
   0.00000000e+00]

As none of them were significant I tried it on a subset of 18 features of multiclass type and the following was the output

[ 0.06456545  0.01254671  0.32220959  0.00552464  0.02017919  0.07311639
  0.00716867  0.06964389  0.04797752  0.06608452  0.02915153  0.02044009
  0.05146265  0.05712569  0.09264365  0.01252251  0.01899865  0.02863864]

Including all features degrades contribution of every feature in the classification but cannot eliminate any one. Should I eliminate the features with relatively lower score?What is the inference of the results above?

Using Scikit-Learn with Python 3.

3

3 Answers

2
votes

u can use sklearn.feature_selection.RFECV

model=ExtraTreesClassifier()
model=RFECV(model,cv=3)
model.fit(features_train,label_train)

This will automatically select the best features by cross-validation and finding their importance in classification.

The model has attributes

n_features_: The number of selected features with cross-validation.

support_ : The mask of selected features. Gives a array of True and False based on index. The considered ones are True and Neglected ones are false

ranking_: The feature ranking. The considered ones are give rank 1 and rest other values.

Refer: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html#sklearn.feature_selection.RFECV

1
votes

In sklearn, feature_importances_ also known as gini importance is measured as follows: For a given feature in a tree-based model, its importance is the probability of the samples reaching that node in the tree.

The value varies from 0 to 1. The value 0 means that the output of the model does not depend on the feature at all and 1 means that the output of the model is directly associated with the feature.

For feature selection, you can use this function called SelectFromModel with which you can specify a threshold. The features with importance values above the threshold will be selected.

check this anwser for more details about how the feature importance is calculated.

1
votes

You are saying "none of them were significant", but the scores you are seeing from feature importance is not p-values. It is basically counting how useful a given feature was useful in splitting the data, and normalizing it, so all feature importances sum to 1.

You should compare the relative values of each feature. Consider using SelectFromModel to do the feature selection as part of a pipeline. http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html