24
votes

if you have this hierarchical clustering call in scipy in Python:

from scipy.cluster.hierarchy import linkage
# dist_matrix is long form distance matrix
linkage_matrix = linkage(squareform(dist_matrix), linkage_method)

then what's an efficient way to go from this to cluster assignments for individual points? i.e. a vector of length N where N is number of points, where each entry i is the cluster number of point i, given the number of clusters generated by a given threshold thresh on the resulting clustering?

To clarify: The cluster number would be the cluster that it's in after applying a threshold to the tree. In which case you would get a unique cluster for each leaf node for the cluster that it is in. Unique in the sense that each point belongs to one "most specific cluster" which is defined by the threshold where you cut the dendrogram.

I know that scipy.cluster.hierarchy.fclusterdata gives you this cluster assignment as its return value, but I am starting from a custom made distance matrix and distance metric, so I cannot use fclusterdata. The question boils down to: how can I compute what fclusterdata is computing -- the cluster assignments?

2
If you give squareform(dist_matrix) to linkage(), the matrix is considered as observations and the clustering results could be incorrect. You can give the condensed vector of distance matrix directly as input to linkage().HongboZhu
one option that u have is that take an average over distance(avg(Z[:,2] ) column of Z. once u will get mean than u can cut from there. This is not a generalized method but u can try it.Gaurav Koradiya

2 Answers

27
votes

If I understand you right, that is what fcluster does:

scipy.cluster.hierarchy.fcluster(Z, t, criterion='inconsistent', depth=2, R=None, monocrit=None)

Forms flat clusters from the hierarchical clustering defined by the linkage matrix Z.

...

Returns: An array of length n. T[i] is the flat cluster number to which original observation i belongs.

So just call fcluster(linkage_matrix, t), where t is your threshold.

0
votes

If you'd like to see the members at every cluster level and in what order they are agglomerated see https://stackoverflow.com/a/43170608/5728789