WEKA - Classifying New Data from Java - IDF Transform

Question

We are trying to implement a WEKA classifier from inside a Java program. So far so good, everything works well however when building the classifier from the training set in Weka GUI we used the StringToWordVector IDF transform to help improve classification accuracy.

How, from within Java for new instances do I calculate the IDF transform to set for each token value in the new instance before passing the instance to the classifier?

The basic code looks like this:

Instances ins = vectorize(msg);
Instances unlabeled = new Instances(train,1);
Instance inst = new Instance(unlabeled.numAttributes());

String tmp = "";

for(int i=0; i < ins.numAttributes(); i++) {
    tmp = ins.attribute(i).name();
    if(unlabeled.attribute(tmp)!=null)
      inst.setValue(unlabeled.attribute(tmp), 1.0); //TODO: Need to figure out the IDF transformed value to put here NOT 1!!
}

unlabeled.add(inst);

unlabeled.setClassIndex(classIdx);

.....cl.distributionForInstance(unlabeled.instance(i));

So how do I go about coding this so that I put the correct value in the new instance I want to classify?

Just to be clear the line inst.setValue(unlabeled.attribute(tmp), 1.0); needs to be changed from 1.0 to the IDF transformed number...

iinception iinception · Accepted Answer · 2011-09-01T15:59:53

You need to use FilteredClassifier for this purpose. The code snippet is :


    StringToWordVector  strWVector = new StringToWordVector();   
    filteredClassifier fcls = new FilteredClassifier();
    fcls.setFilter(strWVector);
    fcls.setClassifier(new SMO());
    fcls.buildClassifier(yourdata)
     //rest of your code

This is much easier as you can pass your instances all at once.FilteredClassifier takes care of all other details. The code is not tested but it will get you started.

Edit : You can do in the following way too. This is code snippet from weka tutorial See http://weka.wikispaces.com/Use+WEKA+in+your+Java+code#Filter-Filtering%20on-the-fly Batch Mode for details


Instances train = ...   // from somewhere
 Instances test = ...    // from somewhere
 Standardize filter = new Standardize();
 filter.setInputFormat(train);  // initializing the filter once with training set
 Instances newTrain = Filter.useFilter(train, filter);  // configures the Filter based on train instances and returns filtered instances
 Instances newTest = Filter.useFilter(test, filter);    // create new test se

HTH

WEKA - Classifying New Data from Java - IDF Transform

1 Answers