1
votes

I am trying to implement a Naive Bayes algorithm to read tweets from a csv file and classify them into categories i define (for example: tech, science, politics)

I want to use NLTK's naive bayes classification algorithm but the example is not anywhere close to what i need to do.

One of my biggest confusion is how do we improve the classification accuracy of NB?

**I am hoping to get some guidance on the detailed steps i need take to do the classification.

  • do i have to create separate csv files for each category where i manually put the tweets in there?
  • How do i train the algorithm if i do the above and how does the algorithm test?**

I have been researching online and found some brief examples like TextBlob which makes use if NLTK's NB algorithm to do sentiment classification of Tweets. it is simple to understand but difficult to tweak for beginners.

http://stevenloria.com/how-to-build-a-text-classification-system-with-python-and-textblob/

In his example from the link above, how does he implement the test when he already put the sentiment next to the tweets? I thought to test, we should hide the second argument.

train = [
    ('I love this sandwich.', 'pos'),
    ('This is an amazing place!', 'pos'),
    ('I feel very good about these beers.', 'pos'),
    ('This is my best work.', 'pos'),
    ("What an awesome view", 'pos'),
    ('I do not like this restaurant', 'neg'),
    ('I am tired of this stuff.', 'neg'),
    ("I can't deal with this", 'neg'),
    ('He is my sworn enemy!', 'neg'),
    ('My boss is horrible.', 'neg')
]
test = [
    ('The beer was good.', 'pos'),
    ('I do not enjoy my job', 'neg'),
    ("I ain't feeling dandy today.", 'neg'),
    ("I feel amazing!", 'pos'),
    ('Gary is a friend of mine.', 'pos'),
    ("I can't believe I'm doing this.", 'neg')
]
1
I can answer your final question with another: How would you measure the success of your test if you did not have the correct sentiment available? The testing routine separates the answer from the text, runs the text through the classifier, and compares the result with the answer. You can look through the NLTK source code to see it. - alexis

1 Answers

4
votes

You have to understand how Bayes works in the first place:

enter image description here

In other words, you have to find P(B|A), P(A) and P(B). In your case, P(A|B) = P(positive | sentence). That is:

  • P(B) = probability of having the very words in sentence
  • P(A) = probability of positive
  • P(B | A) = given positive sentiment, what is the probability to find the words in B

What you have to do is this:

  • split the sentences into words
  • remove "fillers" like "the", "and", "is", "was" etc.
  • for each sentence create a list of attributes like "good", "bad", "amazing" etc. These become the features of your Bayesian classifier.
  • Find the probability B (percentage of features) that make up a "positive" sentiment.

Next, given a test sentence:

  1. Split it into features like you did with the training sentences.
  2. Find the score of these words (B)
  3. Calculate the probability of these indicating a "positive" or "negative" sentiment (=P(A|B)).

There's a bit of hand-weaving in these arguments, find more specific instructions here, you already mentioned the second link in your question:

To answer your specific question:

In his example from the link above, how does he implement the test when he already put the sentiment next to the tweets? I thought to test, we should hide the second argument.

In order to test you need to know what the correct result is. Otherwise you have no way of telling how good the algorithm performs as it will always give you "some" answer. That is why you have to include the label (second argument) in your tests.