2
votes

I have prepared two different .arff files from two different datasets one for testing and other for training. Each of them have equal instances but different features changing the dimensionality of feature vector for each file. When i did cross-validation on each of these files, they are working perfectly. This shows .arff files are properly prepared and don't have any error.

Now if i use the train file having less dimensionality compared to test file for evaluation. I get a following error.

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 5986
at  weka.classifiers.bayes.NaiveBayesMultinomial.probOfDocGivenClass(NaiveBayesMultinomial.java:295)
at weka.classifiers.bayes.NaiveBayesMultinomial.distributionForInstance(NaiveBayesMultinomial.java:254)
at weka.classifiers.Evaluation.evaluationForSingleInstance(Evaluation.java:1657)
at weka.classifiers.Evaluation.evaluateModelOnceAndRecordPrediction(Evaluation.java:1694)
at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:1574)
at TrainCrossValidateARFF.main(TrainCrossValidateARFF.java:44)

Does test file in weka requires same or less number of features as train ? Code for evaluation

public class TrainCrossValidateARFF{
    private static DecimalFormat df = new DecimalFormat("#.##");
    public static void main(String args[]) throws Exception
    {
            if (args.length != 1 && args.length != 2) {
                    System.out.println("USAGE: CrossValidateARFF <arff_file> [<stop_words_file>]");
                    System.exit(-1);
            }
            String TrainarffFilePath = args[0];
            DataSource ds = new DataSource(TrainarffFilePath);
            Instances Train = ds.getDataSet();
            Train.setClassIndex(Train.numAttributes() - 1);

            String TestarffFilePath = args[1];
            DataSource ds1 = new DataSource(TestarffFilePath);
            Instances Test  = ds1.getDataSet();
            // setting class attribute
            Test.setClassIndex(Test.numAttributes() - 1);

            System.out.println("-----------"+TrainarffFilePath+"--------------");
            System.out.println("-----------"+TestarffFilePath+"--------------");
            NaiveBayesMultinomial naiveBayes = new NaiveBayesMultinomial();
            naiveBayes.buildClassifier(Train);

            Evaluation eval = new Evaluation(Train);
            eval.evaluateModel(naiveBayes,Test);
            System.out.println(eval.toSummaryString("\nResults\n======\n", false));
}
}
3

3 Answers

5
votes

Does test file in weka requires same or less number of features as train ? Code for evaluation

Same number of features are necessary. You may need to insert ? for class attribute too.

According to Weka Architect Mark Hall

To be compatible, the header information of the two sets of instances needs to be the same - same number of attributes, with the same names in the same order. Furthermore, any nominal attributes must have the same values declared in the same order in both sets of instances. For unknown class values in your test set just set the value of each to missing - i.e "?".

1
votes

According to Weka's wiki, the number of features needs to be same for both the training and test sets. Also the type of these features (e.g., nominal, numeric, etc) needs to be the same.

Also, I assume that you didn't apply any Weka filters to either of your datasets. The datasets often become incompatible if you apply filters separately on each dataset (even if it is the same filter).

0
votes

How do I divide a dataset into training and test set?

You can use the RemovePercentage filter (package weka.filters.unsupervised.instance).

In the Explorer just do the following:

training set:

-Load the full dataset

-select the RemovePercentage filter in the preprocess panel

-set the correct percentage for the split

-apply the filter

-save the generated data as a new file

test set:

-Load the full dataset (or just use undo to revert the changes to the dataset)

-select the RemovePercentage filter if not yet selected

-set the invertSelection property to true

-apply the filter

-save the generated data as new file