1
votes

I have a classification dataset with 148 input features (20 of which are binary and the rest are continuous on the range [0,1]). The dataset has 66171 negative example and only 71 positive examples.

The dataset (arff text file) can be downloaded from this dropbox link: https://dl.dropboxusercontent.com/u/26064635/SDataset.arff.

In Weka suite, when I use CfsSubsetEval and GreedyStepwise (with setSearchBackwards() set to true and also false), the selected feature set contains only 2 features (i.e. 79 and 140)! It is probably needless to say that the classification performance with these two features are terribly bad.

Using ConsistencySubsetEval (in Weka as well) leads to the selection of ZERO features! When feature ranking methods are used instead and the best (e.g. 12) features are selected, a much better classification performance is achieved.

I have two questions:

First, What is it about the dataset that leads to the selection of such a few features? is it because of the imbalance between the number of positive and negative examples?

Second, and more importantly, are there any other subset selection methods (in Matlab or otherwise) that I can try and may lead to the selection of more features?

1

1 Answers

0
votes

Clearly, the class imbalance is not helping. You could try to take a subsample of the dataset for better diagnostic. SpreadSubsample filter lets you do that, stating what are the maximun class imbalance admisible, like 10:1, 3:1, or whatever you find appropriate.

For selection methods, you could try dimensionality reduction methods, like PCA, in WEKA, first.

But if the algorithms are selecting those sets of features, they seem to be the most meaningful for your classificatin task.