0
votes

I have difficulties to understand how to measure precision and recall for multi class clustering. Here is an example with 9 elements:

considering the following ground truth:

A,B,C,D
E,F,G
H,I

and the following observed clustering:

A,B,C
D
E,F,G,H,I

how should I calculate the number of true positives (TP), false positives (FP) and false negatives (FN) ?

my naive approach has been to consider all pairs of elements:

TP = 7 (A-B, A-C, B-C, E-F, E-G, F-G, H-I)
FP = 6 (E-H, E-I, F-H, F-I, G-H, G-I)
FN = 3 (A-D, B-D, C-D)

Is it the right way of doing it ?

Thanks

1

1 Answers

0
votes

Yes, TP etc. look good to me at first sight.

But enumerating all pairs is slow.

You can do better: you can directly compute the number of pairs from a cross tabulation matrix.

There should be TP=3*2/2+3*2/2+2*1/2=7

FN=3*2/2+5*4/2-TP=13-7=6

FP=4*3/2+3*2/2+2*1/2-TP=10-7=3

etc.

But then rather compute Adjusted Rand Index (ARI). Because you want a measure where a random result only scores close to 0. With precision and recall, results tend to appear much better than they are.