A few questions on stackoverflow mention this problem, but I haven't found a concrete solution.
I have a square matrix which consists of cosine similarities (values between 0 and 1), for example:
| A | B | C | D
A | 1.0 | 0.1 | 0.6 | 0.4
B | 0.1 | 1.0 | 0.1 | 0.2
C | 0.6 | 0.1 | 1.0 | 0.7
D | 0.4 | 0.2 | 0.7 | 1.0
The square matrix can be of any size. I want to get clusters (I don't know how many) which maximize the values between the elements in the cluster. I.e. for the above example I should get two clusters:
- B
- A, C, D
The reason being because C & D have the highest value between them, and A & C also have the highest value between them.
An item can be in only one cluster.
Recall is not that important for this problem, but precision is very important. It is acceptable to output three clusters: 1) B, 2) A, 3) C, D . But it is not acceptable to output any solution where B is in a cluster with another element.
I think the diagonal (1.0) is confusing me. My data is guaranteed to have at least one cluster of 2+ elements, and I want to find as many clusters as possible without sacrificing precision.
I will have to implement this in Python.