Creating a sentiment analysis tool

Question

I'm trying to create a sentiment analysis tool to analyse tweets over a three day period about Manchester United football club and determine whether people view them positively or negatively. I am currently using this guide for guidance (with Java being my coding language)

http://cavajohn.blogspot.co.uk/2013/05/how-to-sentiment-analysis-of-tweets.html

I am using Apache Flume to download my tweets into Apache Hadoop and then am intending to use Apache Hive to query the tweets. I may also use Apache Oozie to partition the tweets effectively.

In the link I posted above, it is mentioned that I need to have a training dataset to train the classifier I will create to analyse the tweets. The sample classifier provided has some 5000 tweets. As I am doing this for a summer project for uni, I feel I should probably create my own dataset.

What is the minimum amount of tweets I should use to make this classifier effective? Is there a recommended number? For example, if I manually analysed a hundred tweets, or five hundred, or a thousand, would it be effective?

Hernandcb Hernandcb · Accepted Answer · 2013-07-24T20:38:06

There is not a exact number to train a classifier. You can have a large dataset where all the data has the same attributes so you classifier will memorize a pattern, or, you can have a no so big dataset with good instances so you classifier will have better results.

You can train the classifier using the sample dataset that they give you in the post and use the cross validation in order to get the best classifier.

After you got the best classifier, you can compare your classifier with the classifier provided in the post and choose the better.

Creating a sentiment analysis tool

2 Answers