1
votes

I am trying to use weka to classify text. What I do is this:

  • I create on big ARFF file with all of the data: all_of_it.arff.
  • I split that data into training and test:train.arff and test.arff
  • I do feature selection on the training set and output a new training file:train_fs.arff
  • I build a classifier with only those selected features.

And the problem is.....

I don't quite know how to standardize the test set to only use the features I selected from the training set. Something like create new test file from test.arff according to train_fs.arff

*I tried using

java -cp weka.jar weka.filters.unsupervised.attribute.Standardize -b -i train_fs.arff -o train2.arff -r test.arff -s test2.arff

but I got the infamous Src and Dest differ in # of attributes.

Is there any way to normalize/standardize the sets according to an arff file (namely my new training data with few features) I don't see how to do this with the Standardize or StringToWordVector filter.

2

2 Answers

1
votes

Batch filtering is one solution to your problem.

Pros:

  • It will apply the same filter to your test dataset as you apply to your training dataset. When you perform feature selection, the two datasets will be compatible

Cons:

  • It is only availabe from the command line interface or Weka's Java API
  • The two datasets must be filtered at the same time

You can read more about Batch filtering here.

1
votes

You may also want to look into InputMappedClassifier. It is a wrapper classifier that addresses incompatible training and testing data.