Connecting probabilities with labels in scikit-learn

Question

I'm training scikit-learn's neighbors.KNeighborsClassifier model object to predict a multi-class classification problem. I've already predicted the most likely class, but now I want to extract the second most likely class predicted using the predict_proba function. However, the output of the function just gives a raw numpy array, which is supposed to be sorted lexicographically. However, when I eyeball the data to see if the probabilities are in alphabetical order, but it does not seem to be.

from sklearn import neighbors
knn_classifier  = neighbors.KNeighborsClassifier(n_neighbors = NUM_NEIGHBORS, weights = 'distance', metric ='haversine' )
knn_classifier.fit(knn_data, response)

unique_levels =  response.unique()
unique_levels.sort()
print unique_levels
    ['Canada' 'DCarea' 'NYarea' 'bostonArea' 'caribbean' 'eastAsia' 'florida'
     'hawaii' 'italy' 'midwest' 'nevada' 'newEngland' 'northernEurope'
     'northern_california' 'northern_france' 'notFound' 'otherSouthernEurope'
     'pacificNW' 'pennArea' 'south' 'southAmerica' 'southeastAsiaAus'
     'southern_california' 'spain' 'texas' 'unitedKingdom' 'west']

knn_preds = knn_classifier.predict(knn_data)
knn_probs = knn_classifier.predict_proba(knn_data)

knn_preds[0:10]
    array(['DCarea', 'NYarea', 'DCarea', 'Canada', 'midwest', 'unitedKingdom',
           'midwest', 'NYarea', 'NYarea', 'south'], dtype=object)

knn_probs[0]
    array([ 0.    ,  0.0667,  0.2667,  0.0333,  0.1   ,  0.    ,  0.    ,
            0.    ,  0.    ,  0.0667,  0.1   ,  0.    ,  0.    ,  0.0667,
            0.    ,  0.    ,  0.    ,  0.0333,  0.    ,  0.1   ,  0.    ,
            0.    ,  0.1333,  0.    ,  0.    ,  0.    ,  0.0333])

knn_probs[1]
    array([ 0.   ,  0.   ,  0.25 ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
            0.   ,  0.125,  0.125,  0.   ,  0.   ,  0.25 ,  0.   ,  0.   ,
            0.   ,  0.125,  0.   ,  0.   ,  0.   ,  0.   ,  0.125,  0.   ,
            0.   ,  0.   ,  0.   ])

If the probabilities were sorted lexicographically, I would expect the second key in knn_probs[0] to have the highest probability, since 'DCarea' was the winning class, and it comes second lexicographically (per above). However, the largest value is the third item in the list. What gives?

Did you ever figure this out? I presume you are using Pandas; have you tried bypassing Pandas and using only numpy arrays and Python lists? — Andreus

Unknown Unknown · Accepted Answer · 2015-07-28T18:01:08

I believe the probability order follows the order of the extracted labels in knn_classifier.classes_. You can zip the classes_ and predict probability vectors together, sort and take the second one.

classes_ = np.array(['a','b','c']) prob_vec = np.array([0.6, 0, 0.4]) sec_class, sec_prob = list(sorted(zip(classes_, prob_vec), key=lambda k: -k[1]))[1]

Connecting probabilities with labels in scikit-learn

1 Answers