3
votes

Im trying to use the scikit learn module for text classification. its a dataset that has lots of unique words. the nature of which will be apparent from the following example,

train_counts = count_vect.fit_transform(data)
train_counts.shape

print len(range(len(data)-1)) 

clf = MultinomialNB(alpha=1).fit(train_counts, range(len(data)) )

docs_new = ['Modern Warfare 2', 'Modern Warfare 3', 'Modern Warfare 1', 'Modern Warfare 4', 'Modern Warfare', 'Mahjong Kakutou Club', 'Mass Effect 2']

new_counts = count_vect.transform(docs_new)
predicted = clf.predict(new_counts)

for doc, category in zip(docs_new, predicted):
    print '%r => %s' % (doc, target_names[category])

and the output looks like this.

763
'Modern Warfare 2' => Call of Duty: Modern Warfare 3
'Modern Warfare 3' => Call of Duty: Modern Warfare 3
'Modern Warfare 1' => Call of Duty: Modern Warfare 3
'Modern Warfare 4' => Call of Duty: Modern Warfare 3
'Modern Warfare' => Call of Duty: Modern Warfare 3
'Mahjong Kakutou Club' => Mahjong Kakutou Club
'Mass Effect 2' => Mass Effect 2

This is a multinomial example, but i get identical results with a bernoulli example. i have tried with alpha values from 0 to 1000000. can anyone explain to me why this is the case?

EDIT: I should have made it clear, the following classes exist Call of Duty: Modern Warfare, Call of Duty: Modern Warfare 2... most other games, the list of all play-station games was taken from wikipedia.

also, the full versions, eg Call of Duty: Modern Warfare 2 as a test string produces the same result

i was originally using the NLTK classifier but for some reason it didn't place much value words like "Kakutou" which were not present in any other examples. (obviously the Scikit one does) It didnt have the problem with the numbers like the scikit classifier does.

Any guidance or information here would be immensely valuable.

Thanks

Edit: the data set is from here http://en.wikipedia.org/wiki/List_of_PlayStation_3_games its the first column, each example is has a lable and content that are the same

1
what is your train data? is it documents with 2-3 words long?zenpoy
its a list of video games, they range from 1 to about 10 words long.user779420
I think you should go over the tutorial this tutorialzenpoy
Could you be more specific as to why? I already have...user779420
clf = MultinomialNB(alpha=1).fit(train_counts, range(len(data)) ) means that you have one class per sample. I don't see how you would expect anything to be able to generalize by seeing only one example per class and having as many distinct classes as examples.ogrisel

1 Answers

1
votes

The code does not show how count_vect is constructed, but if it just a default initialized CountVectorizer, then it ignores character tokens (ie the series numbers) making all "Modern Warfare..." titles tokenize the same as "Modern Warfare":

>>> from sklearn.feature_extraction.text import CountVectorizer as CV
>>> count_vect=CV()
>>> docs_new = ['Modern Warfare 2', 'Modern Warfare 3', 'Modern Warfare 1', 'Modern Warfare 4', 'Modern Warfare A', 'Modern Warfare 44', 'Modern Warfare AA', 'Modern Warfare', 'Mahjong Kakutou Club', 'Mass Effect 2']
>>> new_counts = count_vect.fit_transform(docs_new)
>>> count_vect.inverse_transform(new_counts)
[array([u'modern', u'warfare'], 
      dtype='<U7'), array([u'modern', u'warfare'], 
      dtype='<U7'), array([u'modern', u'warfare'], 
      dtype='<U7'), array([u'modern', u'warfare'], 
      dtype='<U7'), array([u'modern', u'warfare'], 
      dtype='<U7'), array([u'44', u'modern', u'warfare'], 
      dtype='<U7'), array([u'aa', u'modern', u'warfare'], 
      dtype='<U7'), array([u'modern', u'warfare'], 
      dtype='<U7'), array([u'club', u'kakutou', u'mahjong'], 
      dtype='<U7'), array([u'effect', u'mass'], 
      dtype='<U7')]

This is because scikit vectorizers have the default setting token_pattern=r'(?u)\b\w\w+\b' The model is just breaking the ties arbitrarily, since neither training nor prediction see any difference between those titles. You can get around this by using token_pattern=r'(?u)\b\w+\b'

>>> from sklearn.feature_extraction.text import CountVectorizer as CV
>>> count_vect=CV(token_pattern=r'(?u)\b\w+\b')
>>> docs_new = ['Modern Warfare 2', 'Modern Warfare 3', 'Modern Warfare 1', 'Modern Warfare 4', 'Modern Warfare A', 'Modern Warfare 44', 'Modern Warfare AA', 'Modern Warfare', 'Mahjong Kakutou Club', 'Mass Effect 2']
>>> new_counts = count_vect.fit_transform(docs_new)
>>> count_vect.inverse_transform(new_counts)
[array([u'2', u'modern', u'warfare'], 
      dtype='<U7'), array([u'3', u'modern', u'warfare'], 
      dtype='<U7'), array([u'1', u'modern', u'warfare'], 
      dtype='<U7'), array([u'4', u'modern', u'warfare'], 
      dtype='<U7'), array([u'a', u'modern', u'warfare'], 
      dtype='<U7'), array([u'44', u'modern', u'warfare'], 
      dtype='<U7'), array([u'aa', u'modern', u'warfare'], 
      dtype='<U7'), array([u'modern', u'warfare'], 
      dtype='<U7'), array([u'club', u'kakutou', u'mahjong'], 
      dtype='<U7'), array([u'2', u'effect', u'mass'], 
      dtype='<U7')]