2
votes

Let's talk about a multilabel classification problem with labels A, B, and C. I can calculate the precision/recall for each label like the following:

  • Precision: Correct NodeX Assignments / Total NodeX Assignments
  • Recall: Correct NodeX Assignments / Total NodeX True Occurrences
  • F1 Measure: 2 * (Precision * Recall) / (Precision + Recall)

Since I have 3 labels I'd like to get a global performance measure by averaging the values of each single node like suggested here.

However, I noticed that this breaks the F1 measure invariant! An example to clarify:

Label, Precision, Recall, F1
A,     0.5,       1.0,    0.666 
B,     1.0,       1.0,    1.0
C,     0.5,       0.5,    0.5
AVG,   0.666,     0.833,  0.611

NOTE: (2 * (0.666 * 0.833) / (0.666 + 0.833)) != 0.611

Is it correct trying to come up with global node based metrics for multilabel classification performance measurement? Is there a better way of doing this?

NOTE: I am aware of other performance measures (Accuracy, ROC/AUC, etc) but I'd like to sort this out as well.

1
I think this may be more appropriate on Theoretical Computer ScienceJim Garrison
Or perhaps CrossValidated: stats.stackexchange.comseaotternerd

1 Answers

2
votes

The F1 average assumes that precision and recall are equally weighted. But this is untrue in reality. Use the averaged precision and recall to calculate the F1 score makes more sense since that will better reflect your favor on precision or recall. Check this article for more details.