Tweet Classifier Feature-Selection NLTK

Question

I'm currently trying to classify Tweets using the Naive Bayes classifier in NLTK. I'm classifying tweets related to particular stock symbols, using the '$' prefix (eg: $AAPL). I've been basing my Python script of off this blog post: Twitter Sentiment Analysis using Python and NLTK . So far, I've been getting reasonably good results. However, I feel there is much, much room for improvement.

In my word-feature selection method, I decided to implement the tf-idf algorithm to select the most informative words. After having done this though, I felt that the results weren't that impressive.

I then implemented the technique on the following blog: Text Classification Sentiment Analysis Eliminate Low Information Features. The results were very similar to the ones obtained with the tf-idf algorithm, which led me to inspect my classifier's 'Most Informative Features' list more thoroughly. That's when I realized I had a bigger problem:

Tweets and real language don't use the same grammar and wording. In a normal text, many articles and verbs can be singled out using tf-idf or stopwords. However, in a tweet corpus, some extremely uninformative words, such as 'the', 'and', 'is', etc., occur just as much as words that are crucial to categorizing text correctly. I can't just remove all words that have less than 3 letters, because some uninformative features are bigger than that, and some informative ones are smaller.

If I could, I would like to not have to use stopwords, because of the need to frequently update the list. However, if that's my only option, I guess I'll have to go with it.

So, to summarize my question, does anyone know how to truly get the most informative words in the specific source that is a Tweet?

EDIT: I'm trying to classify into three groups: positive, negative, and neutral. Also, I was wondering, for TF-IDF, should I only be cutting off the words with the low scores, or also some with the higher scores? In each case, what percentage of the vocabulary of the text source would you exclude from the feature selection process?

How big is your corpus of tweets? What kind of scores are you getting right now? Also, have you considered using a different classifier than Naive Bayes and/or using other features than just words (e.g. author)? — Fred Foo
I have not considered using other features: the authors would be too diverse. My corpus, for the moment, is only of the order of a couple of hundred of tweets. As for the scores, depending on the size of my test corpus (always getting bigger), they range from 0 to 0.3, I'd say. — elliottbolzan
by score, I meant accuracy/F1/whatever you're measuring. And you might get better results if you have a larger corpus: e.g. idf will may get much more accurate. — Fred Foo
Well, when I calculated the accuracy, it gave me a value between 0 and 1. And I understand your point for the larger corpus, but it's just strange that 'the' has a high tf-idf score, whatever the source. — elliottbolzan
Do you mean your accuracy is between 0 and .3? That's pretty poor. What's the number of classes? — Fred Foo

David Robinson David Robinson · Accepted Answer · 2012-01-08T16:27:07

The blog post you links to describes the show_most_informative_features method, but the NaiveBayesClassifier also has a most_informative_features method that returns the features rather than just printing them. You could simply set a cutoff based on your training set- features like "the", "and" and other unimportant features would be at the bottom of the list in terms of informativeness.

It's true that this approach could be subject to overfitting (some features would be much more important in your training set than in your test set), but that would be true of anything that filters features based on your training set.

Tweet Classifier Feature-Selection NLTK

1 Answers