3
votes

I want to run a 10-fold cross validation train/test experiment using Weka on a dataset that is already divided into 10 folds (i.e. each instance is already allocated to a fold). I know that if you give Weka a complete dataset, it can create the folds for you and run 10-fold CV, but is there a way to tell it which instances should belong to which fold, other than manually splitting the dataset into 10 train/test sets and running each of the 10 experiments independently?

Thanks

1

1 Answers

3
votes

Will you use Java? A simple example of training and evaluating classifiers using the Java interface is given here and excerpted below: http://www.programcreek.com/2013/01/a-simple-machine-learning-example-in-java/

The first step is creating the splits:

// Do 10-split cross validation
Instances[][] split = crossValidationSplit(data, 10);

// Separate split into training and testing arrays
Instances[] trainingSplits = split[0];
Instances[] testingSplits = split[1];

And then doing the typical training/evaluation:

// For each training-testing split pair, train and test the classifier
for (int i = 0; i < trainingSplits.length; i++) {
    Evaluation validation = classify(models[j], trainingSplits[i], testingSplits[i]);

    predictions.appendElements(validation.predictions());
}

By contrast, some of my earlier code used the trainCV/testCV functions to get these instance sets. You could write a new function returning these subsets of the data for your known folds.

Perhaps the source code to Instances.trainCV is a good example of how to build Instance sets: http://grepcode.com/file/repo1.maven.org/maven2/nz.ac.waikato.cms.weka/weka-stable/3.6.7/weka/core/Instances.java

The key lines are calls to copyInstances.

copyInstances(0, train, first);
copyInstances(first + numInstForFold, train,
      numInstances() - first - numInstForFold);
return train;

Perhaps it is also possible to simply create an empty Instances using new Instances(java.lang.String name, java.util.ArrayList<Attribute> attInfo, int capacity) and populating using Instances.get from the superset of all instances.

Another way would be using a filter, like the following excerpted example when I was training a multiple classifiers with the same splits (in Matlab/java):

filterRand = Randomize();
filterRange = RemoveRange();

%if the filterInst parameter is active, take a subsample of training
if doFilterTrain
    rangeStr = sprintf('%g-%g', 1, learnParams.trainSizeMax );
    filterRange.setInstancesIndices( rangeStr );
    filterRange.setInvertSelection( 1 );
end

if doFilterTrain
    filterRand.setInputFormat( instTrain );
    filterRange.setInputFormat( instTrain );

    instTrainSub = Filter.useFilter(instTrain, filterRand);
    instTrainSub = Filter.useFilter(instTrainSub, filterRange);
end

Good luck!