0
votes

Let's assume that we have a classification problem with 3 classes and that we have highly imbalanced data. Let's say in class 1 we have 185 data points, in class 2 199 and in class 3 720.

For calculating the AUC on a multiclass problem there is the macro-average (giving equal weight to the classification of each label) and micro-average method (considering each element of the label indicator matrix as a binary predictio) as written in the scikit-learn tutorial.

For such imbalanced dataset should micro-averaging or macro-averaging of AUC be used?

I'm unsure because when we have a confusion matrix as shown below, I'm getting a micro-averaged AUC of 0.76 and a macro-averaged AUC of 0.55.

enter image description here

1
I'm voting to close this question as off-topic because it is not about programming. - desertnaut
micro-average should be the recommended one for imbalanced dataset, but there seems to be some inconsistency with the example data you provided vs, the confusion matrix, e.g., for class 1, the number of data points (first row) in the confusion matrix does not sum to 200, likewise for class 2 and 3. - Sandipan Dey
@SandipanDey Thank you very much for your answer. I have updated the quesiton regarding the number of data points. But why do I get for micro-averaging a so much higher value than for macro-averaging for this confusion matrix? - BlackHawk
I’m voting to close this question because it belongs to datascience.stackexchange.com - jopasserat

1 Answers

2
votes

Since you have the class with majority number of data points classified with much higher precision, the overall precision computed with micro-average is going to be higher than the same computed with macro-average.

Here, P1 = 12/185 = 0.06486486, P2 = 11/199 = 0.05527638, P3 = 670 / 720 = 0.9305556

overall precision with macro-average = (P1 + P2 + P3) / 3 = 0.3502323, which is much less than overall precision with micro-average = (12+11+670)/(185+199+720) = 0.6277174.

Same holds true for AUC.