Let's talk about a multilabel classification problem with labels A, B, and C. I can calculate the precision/recall for each label like the following:
- Precision: Correct NodeX Assignments / Total NodeX Assignments
- Recall: Correct NodeX Assignments / Total NodeX True Occurrences
- F1 Measure: 2 * (Precision * Recall) / (Precision + Recall)
Since I have 3 labels I'd like to get a global performance measure by averaging the values of each single node like suggested here.
However, I noticed that this breaks the F1 measure invariant! An example to clarify:
Label, Precision, Recall, F1
A, 0.5, 1.0, 0.666
B, 1.0, 1.0, 1.0
C, 0.5, 0.5, 0.5
AVG, 0.666, 0.833, 0.611
NOTE: (2 * (0.666 * 0.833) / (0.666 + 0.833)) != 0.611
Is it correct trying to come up with global node based metrics for multilabel classification performance measurement? Is there a better way of doing this?
NOTE: I am aware of other performance measures (Accuracy, ROC/AUC, etc) but I'd like to sort this out as well.