I am trying to implement a Naive Bayes algorithm to read tweets from a csv file and classify them into categories i define (for example: tech, science, politics)
I want to use NLTK's naive bayes classification algorithm but the example is not anywhere close to what i need to do.
One of my biggest confusion is how do we improve the classification accuracy of NB?
**I am hoping to get some guidance on the detailed steps i need take to do the classification.
- do i have to create separate csv files for each category where i manually put the tweets in there?
- How do i train the algorithm if i do the above and how does the algorithm test?**
I have been researching online and found some brief examples like TextBlob which makes use if NLTK's NB algorithm to do sentiment classification of Tweets. it is simple to understand but difficult to tweak for beginners.
http://stevenloria.com/how-to-build-a-text-classification-system-with-python-and-textblob/
In his example from the link above, how does he implement the test when he already put the sentiment next to the tweets? I thought to test, we should hide the second argument.
train = [
('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')
]
test = [
('The beer was good.', 'pos'),
('I do not enjoy my job', 'neg'),
("I ain't feeling dandy today.", 'neg'),
("I feel amazing!", 'pos'),
('Gary is a friend of mine.', 'pos'),
("I can't believe I'm doing this.", 'neg')
]
