I am working with data to classify handwritten numbers from 0 to 9. I am using PCA to reduce the dimensionality to 6 principal components and KNN to model the data.
When I created the confusion matrix, I got reasonable answers out. It wasn't perfect, wasn't expecting it to be, but it made sense considering the accuracy of ~0.8885 for my k-value.
array([[ 952, 0, 2, 1, 0, 9, 9, 0, 7, 0],
[ 0, 1125, 0, 3, 0, 0, 5, 1, 1, 0],
[ 7, 5, 973, 11, 4, 2, 9, 3, 18, 0],
[ 4, 9, 15, 846, 2, 40, 2, 7, 82, 3],
[ 3, 4, 9, 6, 830, 5, 16, 11, 0, 98],
[ 23, 1, 9, 38, 9, 787, 9, 2, 10, 4],
[ 17, 8, 16, 2, 13, 9, 893, 0, 0, 0],
[ 2, 14, 13, 3, 54, 4, 0, 909, 6, 23],
[ 16, 2, 25, 60, 23, 23, 4, 6, 802, 13],
[ 11, 5, 7, 16, 155, 15, 4, 21, 7, 768]],
dtype=int64)
However, when I try and plot the ROC Curve I either get 3 points outputted to fpr and tpr and the curve seems abnormally high. I was sure I needed more points so I tried changing my approach to computing the roc_curve, but now I get obscenely low results from my curve that don't make sense to my confusion matrix. It seems like the ROC's just increase in accuracy as I go down the list of classes to check.
I was wondering what I could be doing wrong in my ROC computation.
accuracy = 0;
predicted_class = np.zeros((np.size(y_test),1))
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(test_projected,y_test)
for i in range (0,np.size(test_projected[:,0])):
query_point = test_projected[i,:]
true_class_of_query_point = y_test[i]
predicted_class[i] = knn.predict([query_point])
if(predicted_class[i] == true_class_of_query_point):
accuracy += 1;
print('Accuracy of k = 3 is ', accuracy/np.size(test_projected[:,0]), '\n')
fig,axs = plt.subplots(5,2,figsize=(15,15))
fig.tight_layout()
j = 0;
k = 0;
y_gnd = np.zeros((10000,1))
for i in range (0,10):
for m in range(0,10000):
if(y_test[m]==i):
y_gnd[m] = 1
else:
y_gnd[m] = 0
fpr,tpr,threshold = metrics.roc_curve(y_gnd,predicted_class)
auc = metrics.roc_auc_score(y_gnd,predicted_class)
Also, are the inputs to the roc_auc_score supposed to be fpr and tpr? I have seen both the labels and predictions as inputs as well as fpr and tpr.
axs[j][k].plot(fpr,tpr)
axs[j][k].set_title('AUC Score for ' +str(i)+ ' is = ' +str(auc)+ '.')
if(k == 1):
j += 1;
k += 1;
if(k > 1):
k = 0;
Edit: New ROC Curves using predict_proba for predicted class
pred = knn.predict_proba(test_projected)
fpr,tpr,threshold = metrics.roc_curve(y_gnd,pred[:,i])
auc = metrics.roc_auc_score(y_gnd,pred[:,i])
predict_proba
rather than predict to get the class probabilities which will then be used by bothroc_curve
androc_auc_score
. your plots (I believe) consider the predicted class labels as the non-thresholded prediction scores (which they are not). – sim