1
votes

I have a large matrix of 500K observations to cluster using hierarchical clustering. Due to the large size, i do not have the computing power to calculate the distance matrix.

To overcome this problem I chose to aggregate my matrix to merge those observations which were identical to reduce my matrix to about 10K observations. I have the frequency for each of the rows in this aggregated matrix. I now need to incorporate this frequency as a weight in my hierarchical clustering.

The data is a mixture of numerical and categorical variables for the 500K observations so i have used the daisy package to calculate the gower dissimilarity for my aggregated dataset. I want to use hclust in the stats package for the aggregated dataset however i want to take into account the frequency of each observation. From the help information for hclust the arguments are as follows:

    hclust(d, method = "complete", members = NULL)

The information for the members argument is:, NULL or a vector with length size of d. See the ‘Details’ section. When you look at the details section you get: If members != NULL, then d is taken to be a dissimilarity matrix between clusters instead of dissimilarities between singletons and members gives the number of observations per cluster. This way the hierarchical cluster algorithm can be ‘started in the middle of the dendrogram’, e.g., in order to reconstruct the part of the tree above a cut (see examples). Dissimilarities between clusters can be efficiently computed (i.e., without hclust itself) only for a limited number of distance/linkage combinations, the simplest one being squared Euclidean distance and centroid linkage. In this case the dissimilarities between the clusters are the squared Euclidean distances between cluster means.

From the above description, i am unsure if i can assign my frequency weights to the members arguments as it is not clear if this is the purpose of this argument. I would like to use it like this:

hclust(d, method = "complete", members = df$freq)

Where df$freq is the frequency of each row in the aggregated matrix. So if a row is duplicated 10 times this value would be 10.

If anyone can help me that would be great,

Thanks

1

1 Answers

0
votes

Yes, this should work fine for most linkages, in particular single, group average and complete linkage. For ward etc. you need to correctly take the weights into account yourself.

But even that part is not hard. Just make sure to use the cluster sizes, because you need to pass the distance of two clusters, not two points. So the matrix should contain the distance of n1 points at location x and n2 points at location y. For min/max/mean this n disappears or cancels out. For ward, you should get a SSQ like formula.