1
votes

Need some guidance related to sentiment analysis on tweets related to music on spark.

I was trying to perform sentiment analysis on twitter data for tweets related to music. After a lot of searching around the net, I have understood how to fetch the tweets using 'tweepy' python api and also realized that I can use 'Naive Bayes classifier' to finally classify the tweets. Now I am confused regarding how to define features for this classification, I am supposed to define at least 500 features. So here are my questions. I do not want to use any already available API like 'textblob' to find the sentiment of a tweet.

1) Can anyone give some examples of features that we can use for classifying music related tweets ? [ can we use tweets with a happy smiley as positive training set ? if so are the words in those tweets features for my classifier ?]

2) How do we generate the training set for this classifier?

3) If I want to filter the tweets for music related tweets, can I use Bloom Filter to achieve it ?

4) What is the size of data I can get through tweepy api ?

Please correct me if there is something wrong with my understanding.

1

1 Answers

2
votes

Since sentiment analysis is supervised task, you should have a training (and test) set. On the training set, you need labels (in case sentiment analysis: positive, negative) given frequently by humans (often called specialist). There no exist a magic number of instances to training set (I worked with 1k5 records). But in case that you need a scientific evidence, you should analyze the mean squared error (MSE) of the model in function of the size of the training set.

1) The most common approach is TF-IDF. It ranks the best features (also smiles and other symbols). You just need to set the number of features. Again, there are no best number, you should make tests to tune your model

2) You need a training set with labels (positive or negative) to each tweet. Generally, it is obtained by the human annotator.

3) I've never used Bloom Filter.

4) Generally, Tweet api just give about 1-2% of all tweets. I guess that Tweepy cannot give you more than it.

I hope this can help you .