6
votes

Using mahout I am able to classify sentiment of data . But I am stuck with a confusion matrix.

I am using mahout 0.7 naive bayes algorithms to classify sentiment of tweets. I use trainnb and testnb naive bayes classifiers to train the classifier and classify sentiment of tweets as 'positive' ,'negative' or 'neutral'.

Sample positive training set

      'positive','i love my i phone'
      'positive' , it's pleasure to have i phone'  

Similarly I have prepared training samples of negative and neutral, it is a huge data set.

The sample test data tweets I am providing is without including sentiments.

  'it is nice model'
  'simply fantastic ' 

I am able to run the mahout classification algorithm, and it gives output of classified instances as confusion matrix .

Next step I need to find out which tweets are showing positive sentiment and which are negative. expected output using classification: to tag text with the sentiment.

       'negative','very bad btr life time'
      'positive' , 'i phone has excellent design features' 

In mahout which algorithm do I need to implement to get output in the above format. or any custom source implementation is required.

To display data 'kindly' suggest me algorithms that apache mahout provides, which will be suitable for my twitter data sentiment analysis.

2

2 Answers

3
votes

In general to classify some text you need to run Naive Bayes with different priors (positive and negative in your case) and then just chose the one that results in greater value.

This excerpt from the Mahout book has some examples. See Listing 2:

Parameters p = new Parameters();
p.set("basePath", modelDir.getCanonicalPath());9
Datastore ds = new InMemoryBayesDatastore(p);
Algorithm a = new BayesAlgorithm();
ClassifierContext ctx = new ClassifierContext(a,ds);
ctx.initialize();

....

ClassifierResult result = ctx.classifyDocument(tokens, defaultCategory);

Here result should hold either "positive" or "negative" label.

1
votes

I am not sure I will be able to help you in full but I hope I will be able to give you some entry points. In general, my advice for you would be to download Mahout's source code and see how examples and target classes are implemented. This is not that easy but you should be ready that Mahout doesn't have easy entry doors. But once you enter them learning curve will be quick.

First of all, it depends on the version of Mahout you are using. I am using 0.7 myself, so my explanation will be regarding 0.7.

public void classify(String modelLocation, RawEntry unclassifiedInstanceRaw) throws IOException {

    Configuration conf = new Configuration();

    NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelLocation), conf);
    AbstractNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model);

    String unclassifiedInstanceFeatures = RawEntry.toNaiveBayesTrainingFormat(unclassifiedInstanceRaw);

    FeatureVectorEncoder vectorEncoder = new AdaptiveWordValueEncoder("features");
    vectorEncoder.setProbes(1); // my features vectors are tiny

    Vector unclassifiedInstanceVector = new RandomAccessSparseVector(unclassifiedInstanceFeatures.split(" ").length());

    for (String feature: unclassifiedInstanceFeatures) {
        vectorEncoder.addToVector(feature, unclassifiedInstanceVector);
    }

    Vector classificationResult = classifier.classifyFull(unclassifiedInstanceVector);

    System.out.println(classificationResult.asFormatString());

}

What happens here:

1) First, you load the model you got by doing trainnb. This model got saved where you specified using -o parameter while calling trainnb. Model is .bin file.

2) StandardNaiveBayesClassifier is created using your model

3) RawEntry is my custom class which is just a wrapper around raw string of my data. toNaiveBayesTrainingFormar takes string I want to classify, removes noise from it based on my needs and simply returns a string of features 'word1 word2 word3 word4'. So, my unclassified raw string got converted into applicable format for classification.

4) Now string of features needs to be encoded as Mahout's Vector because classifier input is only in Vector

5) Pass vector to classifier - magic.

This is the first part. Now, classifier returns you Vector which contains classes (sentiments in your case) with probabilities. You want specific output. The most straightforward to implement (but I assume not the most efficient and stylish) would be to do next:

1) You create map reduce job which goes through all data you want to classify

2) For each instance you call classify method (don't forget to do few changes not to create StandardNaiveBayesClassifier for every instance)

3) Having classification result vector you can output data in whatever format you whish in your map reduce job

4) Useful setting here is jC.set("mapreduce.textoutputformat.separator", " "); where jC is JobConf. This allows you to choose separator for your output file from mapreduce job. In your case this is ",".

Again, this all applies to Mahout 0.7. No guarantees it will work for you as is. It worked for me though.

In general, I never worked with Mahout from command-line and for me Mahout from Java is the way to go.