3
votes

I am currently working on sentimental analysis of twitter data for one of telecom company data.I am loading the data into HDFS and using Mahout's Naive Bayes Classifier for predicting the sentiments as positive,negative or neutral .

Here's is what i am doing

  1. I am providing training data to the machine (key :sentiment,value:text) .

  2. Using mahout library by calculating tf-idf(Inverse Document Frequency) of text it is creating feature vector.

    mahout seq2sparser -i /user/root/new_model/dataseq --maxDFPercent 1000000 --minSupport 4 --maxNGramSize 2 -a org.apache.lucene.analysis.WhitespaceAnalyzer -o /user/root/new_model/predicted

  3. Splitting data as training set and testing set.

  4. That feature vector I am passing to the naive Bayes algorithm to build a model.

mahout trainnb -i /user/root/new_model/train-vectors -el -li /user/root/new_model/labelindex -o /user/root/new_model/model -ow -c

  1. Using this model I am predicting sentiment of new data.

This is very simple implementation what I am doing , By this implementation I am getting very low accuracy even if i have good training set . So I was thinking of switching to Logistic regression/SVM because they give better results for these kind of problem .

So my question how can i use these algorithm for building my model or predicting the sentiments of tweets using these two algorithms . What steps i need to follow to achieve this ?

1
Are you filtering the words using stop words? How low is your accuracy? Is your accuracy calculated over one single test set, or is it cross-validated? - usual me
No I am not removing stop words . I have tested it for 1000 testing data . Accuracy is around 65% . - Deepesh Shetty
If you keep the stop words (i.e., noisy features) and do only 1 pass of train / test, then the resulting accuracy might not be very meaningful. Before deciding whether to change the algorithm, I suggest to make sure that 65% is an accurate estimate of the accuracy. For example, you could perform a cross-validation (I don't know if this is possible with Mahout), or you could run you train-test procedure n times and compute the average accuracy (70% train-30% test or 90% train-10% test are common schemes) - usual me
@jfk916 We have tried removing the stop words too . Still it doesn't increase accuracy much . - Deepesh Shetty

1 Answers

0
votes

Try using CrossFoldLearner but I am doubtful if it takes naïve Bayes as learning model, I had used OnlineLogisticRegression some time ago. Or hopefully you can write your own crossFoldLearner with naïve Bayes as the learner. Also I don't think changing algorithm would improve the results drastically. Which implies you have to carefully look into the analyzer for doing the tokenization. Perhaps consider bigram tokenization, instead of only using unigram tokens. Have you given thought to phonetics as most of the twitter words are not from dictionary.