2
votes

I am working on sentiment analysis of tweets. I am using mahout naive bayes classifier for it.I am making a directory "data".Inside "data" I am making three more directories named "positive","negative","uncertain"..Then I kept 151 files(total 151Mb) on each of these positive,negatie and uncertain directory..Then I kept the data directory in hdfs..below are the set of command i ran to generate the model and labelindex out of it.

bin/mahout seqdirectory -i ${WORK_DIR}/data -o ${WORK_DIR}/data-seq

bin/mahout seq2sparse -i ${WORK_DIR}/data-seq -o ${WORK_DIR}/data-vectors -lnorm -nv -wttfidf

bin/mahout split -i ${WORK_DIR}/data-vectors/tfidf-vectors --trainingOutput ${WORK_DIR}/data-train-vectors --testOutput ${WORK_DIR}/data-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

bin/mahout trainnb -i ${WORK_DIR}/data-train-vectors -el -o ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow $c

I am getting the confusion matrix after testing on the same set of data using "testnb" command as given below:

 bin/mahout testnb -i ${WORK_DIR}/data-train-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data-testing $c

Confusion Matrix
-------------------------------------------------------
a           b        c       <--Classified as
151         0        0   |  151         a     = negative
0           151      0   |  151         b     = positive
0           0       151  |  151         c     = uncertain

Then I created a some another directory "data2" in the same way and put some random data(which is a sub set of the training data(30 files(total size 30MB) each)) in positive,negative,uncertain directory inside it .Then i created a vector out of it using the "seq2sparse" command given below :-

bin/mahout seqdirectory -i ${WORK_DIR}/data2 -o ${WORK_DIR}/data2-seq

bin/mahout seq2sparse -i ${WORK_DIR}/data2-seq -o ${WORK_DIR}/data2-vectors -lnorm -nv -wttfidf

On running the "testnb" using the model/lablelindex created from the previous set of data using the command given below:-

bin/mahout testnb -i ${WORK_DIR}/data2-vectors/tfidf-vectors/part-r-00000 -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data2-testing $c

I am getting confusion matrix like this.

Confusion Matrix
-------------------------------------------------------
a       b       c           <--Classified as
0      30       0       |  30       a     = negative
0      30       0       |  30       b     = positive
0      30       0       |  30       c     = uncertain

Can anyone tell me why this is coming.Am i using the correct way to test the model or it is a bug in mahout 0.7.If it is not the correct way please suggest a way out of it.

1

1 Answers

1
votes

Can you try this:

bin/mahout testnb -i ${WORK_DIR}/data2-vectors/tfidf-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data2-testing $c

(remove the "part-r-00000")