9
votes

I am really confused how to compute precision and recall in clustering applications.

I have the following situation:

Given two sets A and B. By using a unique key for each element I can determine which of the elements of A and B match. I want to cluster those elements based on features (not using the unique key of course).

I am doing the clustering but I am not sure how to compute precision and recall. The formulas,according to the paper "Extended Performance Graphs for Cluster Retrieval" (http://staff.science.uva.nl/~nicu/publications/CVPR01_nies.pdf) are:

p = precision = relevant retrieved items/retrieved items and r = recall = relevant retrieved items/relevant items

I really do not get what elements fall under which category.

What I did so far is, I checked within the clusters how many matching pairs I have (using the unique key). Is that already one of precision or recall? And if so, which one is it and how can I compute the other one?

Update: I just found another paper with the title "An F-Measure for Evaluation of Unsupervised Clustering with Non-Determined Number of Clusters" at http://mtg.upf.edu/files/publications/unsuperf.pdf.

7

7 Answers

8
votes

I think you'll find wikipedia has a helpful article on precision and recall. In short:

Precision = true positives / (true positives + false positives)

Recall = true positives /( true positivies + false negatives)

8
votes

There are several other measures of cluster validity that I've been using in some research I've been doing in accessing clustering methods. In cases where you have a dataset labeled with classes (supervised clustering) you can use precision and recall as mentioned above, or purity and entropy.

Purity of a cluster = the number of occurrences of the most frequent class / the size of the cluster (this should be high)

Entropy of a cluster = a measure of how dispersed classes are with a cluster (this should be low)

In cases where you don't have the class labels (unsupervised clustering), intra and inter similarity are good measures.

Intra-cluster similarity for a single cluster = average cosine similarity of all pairs within a cluster (this should be high)

Inter-cluster similarity for a single cluster = average cosine sim of all items in one cluster compared to all items in every other cluster (this should be low)

This paper has some good descriptions of all four of these measures. http://glaros.dtc.umn.edu/gkhome/fetch/papers/edcICAIL05.pdf

Nice link with the unsupervised F-measure, I'm looking into that right now.

2
votes

What I make of this problem is:

One of the sets A and B is the "positive" one. Lets suppose A is positive

Given that for an element of A in a cluster

  1. matching element of B is in the same cluster. it is a true positive
  2. matching element of B is not in the same cluster. it is a false negative
  3. non matching element of B is in the same cluster. is is a false positive
  4. non matching element of B is not in the same cluster. is is a true negative.

Then just use

Precision = true positives / (true positives + false positives)

Recall = true positives /( true positivies + false negatives) as mentioned by someone

1
votes

I think there's a problem with your definitions.

Precision and recall are suited for classification problem, which are basically two-clusters problems. Had you clustered into something like "good items" (=retrieved items) and "bad items" (=non retrieved items), then your definition would make sense.

In your case you calculated the percentage of correct clustering out of all the items, which is sort of like precision, but not really because as I said the definitions don't apply.

1
votes

See "Introduction to Information Retrieval", chapter 18 (fat clustering), for ways to evaluate clustering algorithms. http://nlp.stanford.edu/IR-book/html/htmledition/flat-clustering-1.html

This section of the book may also prove useful as it discusses metrics such as precision and recall: http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-unranked-retrieval-sets-1.html

1
votes

The problem with precision and recall is that they generally require you to have some idea of what the 'true' labels are, whereas in many cases (and in your description) you don't know the labels, but you know the partition to compare against. I'd suggest the adjusted Rand index perhaps:

http://en.wikipedia.org/wiki/Rand_index

0
votes

If you consider one of the sets, say A, as gold clustering and the other set (B) as an output of your clustering process, (exact) precision and recall values can be estimated as:

Precision = (Number of elements common to A and B)/(Number of Elements in B)

Recall = (Number of elements common to A and B)/(Number of Elements in A)

From these standard F measure can be estimated as well.