4
votes

I want to classify certain data into different classes based on its content. I did it using naive bayes classifier and I get an output as the best category to which it belongs. But now I want to classify the news other than those in the training set into "others" class. I can't manually add each/every data other than the training data into a certain class since it has vast number of other categories.So is there any way to classify the other data?.

private static File TRAINING_DIR = new File("4news-train");
private static File TESTING_DIR = new File("4news-test");
private static String[] CATEGORIES = { "c1", "c2", "c3", "others" };

private static int NGRAM_SIZE = 6;

public static void main(String[] args) throws ClassNotFoundException, IOException {
    DynamicLMClassifier<NGramProcessLM> classifier = DynamicLMClassifier.createNGramProcess(CATEGORIES, NGRAM_SIZE);
    for (int i = 0; i < CATEGORIES.length; ++i) {
        File classDir = new File(TRAINING_DIR, CATEGORIES[i]);
        if (!classDir.isDirectory()) {
            String msg = "Could not find training directory=" + classDir + "\nTraining directory not found";
            System.out.println(msg); // in case exception gets lost in shell
            throw new IllegalArgumentException(msg);
        }

        String[] trainingFiles = classDir.list();
        for (int j = 0; j < trainingFiles.length; ++j) {
            File file = new File(classDir, trainingFiles[j]);
            String text = Files.readFromFile(file, "ISO-8859-1");
            System.out.println("Training on " + CATEGORIES[i] + "/" + trainingFiles[j]);
            Classification classification = new Classification(CATEGORIES[i]);
            Classified<CharSequence> classified = new Classified<CharSequence>(text, classification);
            classifier.handle(classified);
        }
    }
}
2
Not sure what you are asking. Your training set is compared of C1,C2,C3 categories only, and you want to classify to 4 categories: C1,C2,C3,others?amit
I would strongly recommend to get pencil and make sure you understand what calculations need to be done. The challenge you are facing has not got anything to do with the code but with the calculations so your question might be best suited for stats.stackexchange.com See the notes below if you need any help with calculations: inf.ed.ac.uk/teaching/courses/inf2b/lectureSchedule.htmlmatcheek
@matcheek I believe the question is in fact about the LingPipe library, not about naive bayes itself.Jakub Kotowski
@matcheek this is not only about lingpipe library but also about naive bayes.I want to classify all those data other than those belongs to c1,c2,c3 into the category "others". Iam just asking how can I implement itlulu
I had build an intermediate model which avoids frequent training. So into that model I specify the testing part.This code is what I have tried first.I trained the contents in different folders in c1 I specify data about c1 and train it.Like wise I have to train "others" too.So I have to build a training data to "others" folder too. So I have to collect a large amount of data other than those related to c1,c2 and c3 for training.There should be some limit rightlulu

2 Answers

0
votes

Just serialize the object...it means write the intermediate object to a file and that will be your model...

Then for testing you just need to pass the data into the model no need to train each time...It will be quite easier for you

1
votes

Naive Bayes gives you the "confidence" in each classification, as it computes

P(y|x) ~ P(y)P(x|y)

Up to the normalization by P(x) it is a probability of x being a part of class y. You can simply cut-off on this value and say, that

cl(x) = "other" iff max_{over y}(P(y|x)) < T

where T can be for example minimum confidence on the training set

T = min_{over x and y in Training set}( P(y|x) )