0
votes

I am trying to use Mallet- Naive-Bayes classifier API. I have modeled the training set and Test set as follows

  • Training : [ID] [Label] [Data]
  • Testing: [ID] [ ] [Data]

Below is the code which I have used:

    public static void main(String[] args) throws FileNotFoundException {
        classify();
        System.out.println("Finished");
    }



    public static void classify() throws FileNotFoundException{

        //prepare instance transformation pipeline
        ArrayList<Pipe> pipes = new ArrayList<Pipe>();
        pipes.add(new Target2Label());
        pipes.add(new CharSequence2TokenSequence());
        pipes.add(new TokenSequence2FeatureSequence());
        pipes.add(new FeatureSequence2FeatureVector());
        SerialPipes pipe = new SerialPipes(pipes);

        //prepare training instances
        InstanceList trainingInstanceList = new InstanceList(pipe);
        trainingInstanceList.addThruPipe(new CsvIterator(new FileReader("resources/training.csv"),  "(\\w+)\\s+(\\w+)\\s+(.*)",  3, 2, 1));  // (data, label, name) field indices ));

        //prepare test instances
        InstanceList testingInstanceList = new InstanceList(pipe);        
        testingInstanceList.addThruPipe(new CsvIterator(new FileReader("resources/testing.csv"), "(\\w+)\\s+(\\w+)\\s+(.*)",  3, 2, 1));

        ClassifierTrainer trainer = new NaiveBayesTrainer();
        Classifier classifier = trainer.train(trainingInstanceList);


        for(Instance testInstance :testingInstanceList){
        Labeling labeling = (Labeling) classifier.classify(testInstance);
        Label l = labeling.getBestLabel();
        System.out.println(testInstance + " =  " +  l);
        }

        System.out.println("Accuracy: " + classifier.getAccuracy(testingInstanceList));

   }
}

It somehow throws me an error as to Line 'x' does not match regex. I understand its a problem when importing the data. But, what is the actual format for representing training and testing set when using mallet.

1

1 Answers

0
votes

It's important to distinguish "Testing" from "Production". Testing implies that you actually know the label, you just want to see if the classifier can guess it correctly. If you don't have a label and want to predict the label, you can classify directly from text data. Here's the documentation from the Mallet website:

To apply a saved classifier to new unlabeled data, use Csv2Classify (for one-instance-per-line data) or Text2Classify (for one-instance-per-file data).

bin/mallet classify-file --input data --output - --classifier classifier
bin/mallet classify-dir --input datadir --output - --classifier classifier

Using the above commands, classifications are written to standard output. Note that the input for these commands is a raw text file, not an imported Mallet file. This command is designed to be used in "production" mode, where labels are not available.