0
votes

I am using Weka to classify a dataset. Each data point is in one of five topics that I am trying to generalize across.

I would like to make each topic a test set so that I can train on topics 1-4 and test on topic 5, then train on topics 1, 3, 4 and 5, and test on 2, and so on.

Is there a way to direct Weka to preform this automatically one time with one dataset? That is, can I direct Weka to cross-validate by topic?

I apologize for redundancy if this question has already been asked. If it indeed has, any help in directing me towards the answer would be most appreciated.

Thanks!

1

1 Answers

0
votes

There are a few ways that I can think of that may assist in getting the results that you desire:

  1. As you have outlined in your question, you could generate 5 different training sets with the remaining topic as the testing set. Each model would need to be trained individually if you were going to use the Weka interface (Supply the training data, the build a classifier and supply a testing set, repeat). This would likely be quickest if it's a once off.

  2. You may be able to use the FilteredClassifier with the filter of RemoveWithValues. This may be able to remove the training cases of a particular topic if the topic number is an available attribute (I am guessing that this data is not part of the model's data though, so attribute filtering may also be required if using this approach).

  3. If you are willing to use Java to program a solution, you would be able to manipulate the data and build each of the five classifiers in one go. I am thinking that the algorithm for such a model would be as outlined below. If you plan to undertake this process a lot, it may be the better solution.

Algorithm:

for each topic t
  training_data = all cases not containing topic t
  testing_data = training_set cases containing topic t
  build classifier using training_data, testing_data
  save classifier
end for