I have a data set with multiple discrete labels, say 4,5,6. On this I run the ExtraTreesClassifier (I will also run Multinomial logit afterword on the same data, this is just a short example) as below. :
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import roc_curve, auc
clf = ExtraTreesClassifier(n_estimators=200,random_state=0,criterion='gini',bootstrap=True,oob_score=1,compute_importances=True)
# Also tried entropy for the information gain
clf.fit(x_train, y_train)
#y_test is test data and y_predict is trained using ExtraTreesClassifier
y_predicted=clf.predict(x_test)
fpr, tpr, thresholds = roc_curve(y_test, y_predicted,pos_label=4) # recall my labels are 4,5 and 6
roc_auc = auc(fpr, tpr)
print("Area under the ROC curve : %f" % roc_auc)
The question is - is there something like a average ROC curve - basically I could add up all the tpr & fpr seperately for EACH label value and then take means (will that make sense by the way?) - and then just call
# Would this be statistically correct, and would mean something worth interpreting?
roc_auc_avearge = auc(fpr_average, tpr_average)
print("Area under the ROC curve : %f" % roc_auc)
I am assuming, I will get something similar to this afterword - but how do I interpret thresholds in this case ? How to plot a ROC curve for a knn model
Hence, please also mention if I can/should get individual thresholds in this case and why would one approach be better(statistically) over the other?
What I tried so far (besides averaging):
On changing pos_label = 4 , then 5 & 6 and plotting the roc curves, I see very poor performance, even lesser than the y=x (perfectly random and tpr=fpr case) How should I approach this problem ?