0
votes

I use Scipy library to perform hierarchical clustering and create the dendrogram. Here is the simple code and the generated dendrogram:

import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt

X = np.array([[5, 3],
              [10, 15],
              [15, 12],
              [24, 10],
              [30, 30],
              [85, 70],
              [71, 80],
              [60, 78],
              [70, 55],
              [80, 91]])
linkage_matrix = linkage(X, "single")
_ = dendrogram(linkage_matrix,)

enter image description here

I need to print all the clusters and samples that belong to each cluster at each step of the clustering process. Here is the desired output for the abovementioned data and dendrogram:

[{0}, {1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}]
[{0}, {1, 2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}]
[{0}, {1, 2, 3}, {4}, {5}, {6}, {7}, {8}, {9}]
[{0}, {1, 2, 3}, {4}, {5}, {6, 7}, {8}, {9}]
[{0, 1, 2, 3}, {4}, {5}, {6, 7}, {8}, {9}]
[{0, 1, 2, 3}, {4}, {5}, {6, 7, 9}, {8}]
[{0, 1, 2, 3}, {4}, {5, 6, 7, 9}, {8}]
[{0, 1, 2, 3, 4}, {5, 6, 7, 9}, {8}]
[{0, 1, 2, 3, 4}, {5, 6, 7, 9, 8}]
[{0, 1, 2, 3, 4, 5, 6, 7, 9, 8}]

Please note that it is also okay if there is a solution using the Scikit-Learn agglomerative clustering.

1

1 Answers

3
votes

Use cut_tree function from the same module, and specify number of clusters as cut condition. Unfortunately, it wont cut in the case where each element is its own cluster, but that case is trivial to add.

Also, the returned matrix from cut_tree is in such shape, that each column represents groups at certain cut. So i transposed the matrix, but you could also adjust the for loops accordingly.

import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage, to_tree, cut_tree
from matplotlib import pyplot as plt

X = np.array([[5, 3],
              [10, 15],
              [15, 12],
              [24, 10],
              [30, 30],
              [85, 70],
              [71, 80],
              [60, 78],
              [70, 55],
              [80, 91]])
linkage_matrix = linkage(X, "single")
clusters = cut_tree(linkage_matrix, n_clusters=range(1, X.shape[0]))
print(clusters)
# insert column for the case, where every element is its own cluster
clusters = np.insert(clusters, clusters.shape[1], range(clusters.shape[0]), axis=1)
# transpose matrix
clusters = clusters.T
print(clusters)
for row in clusters[::-1]:
    # create empty dictionary
    groups = {}
    for i, g in enumerate(row):
        if g not in groups:
            # add new key to dict and assign empty set
            groups[g] = set([])
        # add to set of certain group
        groups[g].add(i)
    print(list(groups.values()))

Much Better Solution

Instead of two for loops and cut_tree, use operation on sets and information from linkage_matrix. The for loop runs in linear time complexity, but the most time consuming will be print statement

In the case of around 30_000 samples, printing to file would create around 30GB big file.

linkage_matrix = linkage(X, "single")

dct = dict([(i, {i}) for i in range(X.shape[0])])
print(list(dct.values()))
for i, row in enumerate(linkage_matrix, X.shape[0]):
    dct[i] = dct[row[0]].union(dct[row[1]])
    del dct[row[0]]
    del dct[row[1]]
    print(list(dct.values()))