Is it correct to average Precision/Recall for global multilabel performance evaluation?

Question

Let's talk about a multilabel classification problem with labels A, B, and C. I can calculate the precision/recall for each label like the following:

Precision: Correct NodeX Assignments / Total NodeX Assignments
Recall: Correct NodeX Assignments / Total NodeX True Occurrences
F1 Measure: 2 * (Precision * Recall) / (Precision + Recall)

Since I have 3 labels I'd like to get a global performance measure by averaging the values of each single node like suggested here.

However, I noticed that this breaks the F1 measure invariant! An example to clarify:

Label, Precision, Recall, F1
A,     0.5,       1.0,    0.666 
B,     1.0,       1.0,    1.0
C,     0.5,       0.5,    0.5
AVG,   0.666,     0.833,  0.611

NOTE: (2 * (0.666 * 0.833) / (0.666 + 0.833)) != 0.611

Is it correct trying to come up with global node based metrics for multilabel classification performance measurement? Is there a better way of doing this?

NOTE: I am aware of other performance measures (Accuracy, ROC/AUC, etc) but I'd like to sort this out as well.

I think this may be more appropriate on Theoretical Computer Science — Jim Garrison

lennon310 lennon310 · Accepted Answer · 2013-12-13T01:58:16

The F1 average assumes that precision and recall are equally weighted. But this is untrue in reality. Use the averaged precision and recall to calculate the F1 score makes more sense since that will better reflect your favor on precision or recall. Check this article for more details.

Is it correct to average Precision/Recall for global multilabel performance evaluation?

1 Answers