Weka: Src and Dest differ in # of attributes after I do feature selection on the training set

Question

I am trying to use weka to classify text. What I do is this:

I create on big ARFF file with all of the data: all_of_it.arff.
I split that data into training and test:train.arff and test.arff
I do feature selection on the training set and output a new training file:train_fs.arff
I build a classifier with only those selected features.

And the problem is.....

I don't quite know how to standardize the test set to only use the features I selected from the training set. Something like create new test file from test.arff according to train_fs.arff

*I tried using

java -cp weka.jar weka.filters.unsupervised.attribute.Standardize -b -i train_fs.arff -o train2.arff -r test.arff -s test2.arff

but I got the infamous Src and Dest differ in # of attributes.

Is there any way to normalize/standardize the sets according to an arff file (namely my new training data with few features) I don't see how to do this with the Standardize or StringToWordVector filter.

Walter Walter · Accepted Answer · 2013-10-14T14:32:11

Batch filtering is one solution to your problem.

Pros:

It will apply the same filter to your test dataset as you apply to your training dataset. When you perform feature selection, the two datasets will be compatible

Cons:

It is only availabe from the command line interface or Weka's Java API
The two datasets must be filtered at the same time

You can read more about Batch filtering here.

Weka: Src and Dest differ in # of attributes after I do feature selection on the training set

2 Answers