Training and Testing data structure : Mallet Classifier

Question

I am trying to use Mallet- Naive-Bayes classifier API. I have modeled the training set and Test set as follows

Training : [ID] [Label] [Data]
Testing: [ID] [ ] [Data]

Below is the code which I have used:

    public static void main(String[] args) throws FileNotFoundException {
        classify();
        System.out.println("Finished");
    }



    public static void classify() throws FileNotFoundException{

        //prepare instance transformation pipeline
        ArrayList<Pipe> pipes = new ArrayList<Pipe>();
        pipes.add(new Target2Label());
        pipes.add(new CharSequence2TokenSequence());
        pipes.add(new TokenSequence2FeatureSequence());
        pipes.add(new FeatureSequence2FeatureVector());
        SerialPipes pipe = new SerialPipes(pipes);

        //prepare training instances
        InstanceList trainingInstanceList = new InstanceList(pipe);
        trainingInstanceList.addThruPipe(new CsvIterator(new FileReader("resources/training.csv"),  "(\\w+)\\s+(\\w+)\\s+(.*)",  3, 2, 1));  // (data, label, name) field indices ));

        //prepare test instances
        InstanceList testingInstanceList = new InstanceList(pipe);        
        testingInstanceList.addThruPipe(new CsvIterator(new FileReader("resources/testing.csv"), "(\\w+)\\s+(\\w+)\\s+(.*)",  3, 2, 1));

        ClassifierTrainer trainer = new NaiveBayesTrainer();
        Classifier classifier = trainer.train(trainingInstanceList);


        for(Instance testInstance :testingInstanceList){
        Labeling labeling = (Labeling) classifier.classify(testInstance);
        Label l = labeling.getBestLabel();
        System.out.println(testInstance + " =  " +  l);
        }

        System.out.println("Accuracy: " + classifier.getAccuracy(testingInstanceList));

   }
}

It somehow throws me an error as to Line 'x' does not match regex. I understand its a problem when importing the data. But, what is the actual format for representing training and testing set when using mallet.

David Mimno David Mimno · Accepted Answer · 2017-06-05T17:58:09

It's important to distinguish "Testing" from "Production". Testing implies that you actually know the label, you just want to see if the classifier can guess it correctly. If you don't have a label and want to predict the label, you can classify directly from text data. Here's the documentation from the Mallet website:

To apply a saved classifier to new unlabeled data, use Csv2Classify (for one-instance-per-line data) or Text2Classify (for one-instance-per-file data).

bin/mallet classify-file --input data --output - --classifier classifier
bin/mallet classify-dir --input datadir --output - --classifier classifier

Using the above commands, classifications are written to standard output. Note that the input for these commands is a raw text file, not an imported Mallet file. This command is designed to be used in "production" mode, where labels are not available.

Training and Testing data structure : Mallet Classifier

1 Answers