I have a classification dataset with 148 input features (20 of which are binary and the rest are continuous on the range [0,1]). The dataset has 66171 negative example and only 71 positive examples.
The dataset (arff
text file) can be downloaded from this dropbox link: https://dl.dropboxusercontent.com/u/26064635/SDataset.arff.
In Weka suite, when I use CfsSubsetEval
and GreedyStepwise
(with setSearchBackwards()
set to true
and also false
), the selected feature set contains only 2 features (i.e. 79
and 140
)! It is probably needless to say that the classification performance with these two features are terribly bad.
Using ConsistencySubsetEval
(in Weka as well) leads to the selection of ZERO features! When feature ranking methods are used instead and the best (e.g. 12) features are selected, a much better classification performance is achieved.
I have two questions:
First, What is it about the dataset that leads to the selection of such a few features? is it because of the imbalance between the number of positive and negative examples?
Second, and more importantly, are there any other subset selection methods (in Matlab or otherwise) that I can try and may lead to the selection of more features?