I'm training scikit-learn
's neighbors.KNeighborsClassifier
model object to predict a multi-class classification problem. I've already predicted the most likely class, but now I want to extract the second most likely class predicted using the predict_proba
function. However, the output of the function just gives a raw numpy array, which is supposed to be sorted lexicographically. However, when I eyeball the data to see if the probabilities are in alphabetical order, but it does not seem to be.
from sklearn import neighbors
knn_classifier = neighbors.KNeighborsClassifier(n_neighbors = NUM_NEIGHBORS, weights = 'distance', metric ='haversine' )
knn_classifier.fit(knn_data, response)
unique_levels = response.unique()
unique_levels.sort()
print unique_levels
['Canada' 'DCarea' 'NYarea' 'bostonArea' 'caribbean' 'eastAsia' 'florida'
'hawaii' 'italy' 'midwest' 'nevada' 'newEngland' 'northernEurope'
'northern_california' 'northern_france' 'notFound' 'otherSouthernEurope'
'pacificNW' 'pennArea' 'south' 'southAmerica' 'southeastAsiaAus'
'southern_california' 'spain' 'texas' 'unitedKingdom' 'west']
knn_preds = knn_classifier.predict(knn_data)
knn_probs = knn_classifier.predict_proba(knn_data)
knn_preds[0:10]
array(['DCarea', 'NYarea', 'DCarea', 'Canada', 'midwest', 'unitedKingdom',
'midwest', 'NYarea', 'NYarea', 'south'], dtype=object)
knn_probs[0]
array([ 0. , 0.0667, 0.2667, 0.0333, 0.1 , 0. , 0. ,
0. , 0. , 0.0667, 0.1 , 0. , 0. , 0.0667,
0. , 0. , 0. , 0.0333, 0. , 0.1 , 0. ,
0. , 0.1333, 0. , 0. , 0. , 0.0333])
knn_probs[1]
array([ 0. , 0. , 0.25 , 0. , 0. , 0. , 0. , 0. ,
0. , 0.125, 0.125, 0. , 0. , 0.25 , 0. , 0. ,
0. , 0.125, 0. , 0. , 0. , 0. , 0.125, 0. ,
0. , 0. , 0. ])
If the probabilities were sorted lexicographically, I would expect the second key in knn_probs[0]
to have the highest probability, since 'DCarea'
was the winning class, and it comes second lexicographically (per above). However, the largest value is the third item in the list. What gives?