I am trying to use weka to classify text. What I do is this:
- I create on big ARFF file with all of the data:
all_of_it.arff
. - I split that data into training and test:
train.arff
andtest.arff
- I do feature selection on the training set and output a new training file:
train_fs.arff
- I build a classifier with only those selected features.
And the problem is.....
I don't quite know how to standardize the test set to only use the features I selected from the training set. Something like create new test file from test.arff
according to train_fs.arff
*I tried using
java -cp weka.jar weka.filters.unsupervised.attribute.Standardize -b -i train_fs.arff -o train2.arff -r test.arff -s test2.arff
but I got the infamous Src and Dest differ in # of attributes
.
Is there any way to normalize/standardize the sets according to an arff file (namely my new training data with few features) I don't see how to do this with the Standardize or StringToWordVector filter.