3
votes

I'm training an SVM classifier on a text dataset, using scikit. The documentation is good for using a count vectorizer to construct a feature vector using n-grams. E.g, for unigrams and bigrams, I can do something like:

   CountVectorizer(ngram_range(1,2))  

However, I wasn't sure how would you go about building emoticons into the feature vector? There seems to be two available options - either use a regex that matches the emoticon and feed it into the

token_pattern

argument to the CountVectorizer, or construct a custom vocabulary that includes the emoticons, and feed that into the

vocabulary 

argument. Any advice - or in particular a simple example, would be great!! Also, let me know if there's any other crucial info that I've missed out of the question..

Edit: My Solution

After some experimentation with the above problem, this was the code that worked for me. It assumes that you have split up your data into arrays, such as:

training_data, training_labels, test_data, test_labels

We use CountVectorizer, so first import that:

from sklearn.feature_extraction.text import CountVectorizer
c_vect = CountVectorizer()

Then build a list of emoticons as an array. (I got my list from text dump online):

emoticon_list = [ ':)', ':-)', ':(' .... etc. - put your long list of emoticons here]

Next, fit the CountVectorizer with the array of emoticons. It's crucial to use fit, not fit_transform:

X = c_vect.fit(emoticon_list)

Then construct a feature vector by counting the number of emoticons in the training data (in my case, an array of tweets) using the transform method:

emoticon_training_features = c_vect.transform(training_data) 

Now we can train our classifier, clf, using the labels and our new emoticon feature vector (remembering that for certain classifiers such as SVC, you will need to first convert your string labels to appropriate numbers):

clf.fit(emoticon_training_features, training_labels)

Then to evaluate the performance of the classifier, we must transform our test data to make use of the emoticon features available:

emoticon_test_features = c_vect.transform(test_data)

Finally, we can perform our prediction:

predicted = clf.predict(emoticon_test_features)

Done. A fairly standard way to evaluate performance at this point is to use:

from sklearn.metrics import classification_report
print classification_report(test_labels, predicted)

Phew. Hope that helps.

1

1 Answers

2
votes

Both options should work.

There's a third option, which is to tokenize your samples manually and feed them to a DictVectorizer instead of a CountVectorizer. Example using the simplest tokenizer there is, str.split:

>>> from collections import Counter
>>> from sklearn.feature_extraction import DictVectorizer
>>> vect = DictVectorizer()
>>> samples = [":) :) :)", "I have to push the pram a lot"]
>>> X = vect.fit_transform(Counter(s.split()) for s in samples)
>>> X
<2x9 sparse matrix of type '<type 'numpy.float64'>'
    with 9 stored elements in Compressed Sparse Row format>
>>> vect.vocabulary_
{'a': 2, ':)': 0, 'I': 1, 'to': 8, 'have': 3, 'lot': 4, 'push': 6, 'the': 7, 'pram': 5}
>>> vect.inverse_transform(X[0])  # just for inspection
[{':)': 3.0}]

However, with DictVectorizer you'll have to build your own bigrams.