I'm training an SVM classifier on a text dataset, using scikit. The documentation is good for using a count vectorizer to construct a feature vector using n-grams. E.g, for unigrams and bigrams, I can do something like:
CountVectorizer(ngram_range(1,2))
However, I wasn't sure how would you go about building emoticons into the feature vector? There seems to be two available options - either use a regex that matches the emoticon and feed it into the
token_pattern
argument to the CountVectorizer, or construct a custom vocabulary that includes the emoticons, and feed that into the
vocabulary
argument. Any advice - or in particular a simple example, would be great!! Also, let me know if there's any other crucial info that I've missed out of the question..
Edit: My Solution
After some experimentation with the above problem, this was the code that worked for me. It assumes that you have split up your data into arrays, such as:
training_data, training_labels, test_data, test_labels
We use CountVectorizer, so first import that:
from sklearn.feature_extraction.text import CountVectorizer
c_vect = CountVectorizer()
Then build a list of emoticons as an array. (I got my list from text dump online):
emoticon_list = [ ':)', ':-)', ':(' .... etc. - put your long list of emoticons here]
Next, fit the CountVectorizer with the array of emoticons. It's crucial to use fit, not fit_transform:
X = c_vect.fit(emoticon_list)
Then construct a feature vector by counting the number of emoticons in the training data (in my case, an array of tweets) using the transform method:
emoticon_training_features = c_vect.transform(training_data)
Now we can train our classifier, clf, using the labels and our new emoticon feature vector (remembering that for certain classifiers such as SVC, you will need to first convert your string labels to appropriate numbers):
clf.fit(emoticon_training_features, training_labels)
Then to evaluate the performance of the classifier, we must transform our test data to make use of the emoticon features available:
emoticon_test_features = c_vect.transform(test_data)
Finally, we can perform our prediction:
predicted = clf.predict(emoticon_test_features)
Done. A fairly standard way to evaluate performance at this point is to use:
from sklearn.metrics import classification_report
print classification_report(test_labels, predicted)
Phew. Hope that helps.