Im trying to use the scikit learn module for text classification. its a dataset that has lots of unique words. the nature of which will be apparent from the following example,
train_counts = count_vect.fit_transform(data)
train_counts.shape
print len(range(len(data)-1))
clf = MultinomialNB(alpha=1).fit(train_counts, range(len(data)) )
docs_new = ['Modern Warfare 2', 'Modern Warfare 3', 'Modern Warfare 1', 'Modern Warfare 4', 'Modern Warfare', 'Mahjong Kakutou Club', 'Mass Effect 2']
new_counts = count_vect.transform(docs_new)
predicted = clf.predict(new_counts)
for doc, category in zip(docs_new, predicted):
print '%r => %s' % (doc, target_names[category])
and the output looks like this.
763
'Modern Warfare 2' => Call of Duty: Modern Warfare 3
'Modern Warfare 3' => Call of Duty: Modern Warfare 3
'Modern Warfare 1' => Call of Duty: Modern Warfare 3
'Modern Warfare 4' => Call of Duty: Modern Warfare 3
'Modern Warfare' => Call of Duty: Modern Warfare 3
'Mahjong Kakutou Club' => Mahjong Kakutou Club
'Mass Effect 2' => Mass Effect 2
This is a multinomial example, but i get identical results with a bernoulli example. i have tried with alpha values from 0 to 1000000. can anyone explain to me why this is the case?
EDIT: I should have made it clear, the following classes exist Call of Duty: Modern Warfare, Call of Duty: Modern Warfare 2... most other games, the list of all play-station games was taken from wikipedia.
also, the full versions, eg Call of Duty: Modern Warfare 2 as a test string produces the same result
i was originally using the NLTK classifier but for some reason it didn't place much value words like "Kakutou" which were not present in any other examples. (obviously the Scikit one does) It didnt have the problem with the numbers like the scikit classifier does.
Any guidance or information here would be immensely valuable.
Thanks
Edit: the data set is from here http://en.wikipedia.org/wiki/List_of_PlayStation_3_games its the first column, each example is has a lable and content that are the same
clf = MultinomialNB(alpha=1).fit(train_counts, range(len(data)) )
means that you have one class per sample. I don't see how you would expect anything to be able to generalize by seeing only one example per class and having as many distinct classes as examples. – ogrisel