2
votes

I just installed scikit 0.14 so that I could explore the multi-label metrics improvements. I got some positive results with the hamming loss metrics and the classification report, but was not able to get the confusion matrix to work. Also on the classification report I was unable to pass the label array and get the labels printed in the report. Below is the code. Am I doing something wrong or is this still in development?

import numpy as np
import pandas as pd
import random

from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsOneClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

target_names = ['New York','London', 'DC']

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york",
                    "DC is the nations capital",
                    "DC the home of the beltway",
                    "president obama lives in Washington",
                    "The washington monument in is Washington DC"])

y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[1,0],[1,0],[2],[2],[2],[2]]


X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'hello welcome to new ybrk. enjoy it here and london too',
                   'What city does the washington redskins live in?'])
y_test = [[0],[1],[0,1],[2]]                   

classifier = Pipeline([
                       ('vectorizer', CountVectorizer(stop_words='english',
                             ngram_range=(1,3),
                             max_df = 1.0,
                             min_df = 0.1,
                             analyzer='word')),
                       ('tfidf', TfidfTransformer()),
                       ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, y_train)

predicted = classifier.predict(X_test)

print predicted


for item, labels in zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))



from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import hamming_loss



hl = hamming_loss(y_test, predicted, target_names)
print " "
print " "
print "---------------------------------------------------------"
print "HAMMING LOSS"
print " "
print hl

print " "
print " "
print "---------------------------------------------------------"
print "CONFUSION MATRIX"
print " "
cm = confusion_matrix(y_test, predicted)   
print cm

print " "
print " "
print "---------------------------------------------------------"
print "CLASSIFICATION REPORT"
print " "
print classification_report(y_test, predicted)
1

1 Answers

0
votes

Multiclass and multilable metric capabilities seem to have been improved in version 0.14 published on August 14, 2013 - scikit-learn.org/stable/whats_new.html

Also, issue 558 seems to address some of this as well and is probably in 0.14 but i have not yet confirmed this - https://github.com/scikit-learn/scikit-learn/issues/558.