0
votes

I am currently looking into hierarchy in the topics of documents. As a first step I find a vector representation of my documents, after which I use hierarchical clustering to determine whether there are topics within topics of the document. I want to only consider (nested) clusters that at least contain, say 2% original data. To achieve this, i am using R.

Now, I am struggling with efficiently extracting the cluster hierarchy from the clustering results. Clustering is done with the "fastcluster" package, which provides similar results as the original "hclust" function.

For my final output should look something like this; there will be two tables

Cluster Assignments:

docID , ClusterLabel
1, A
2, A
3, B
4, B
5, B
3, C
4, D
5, C 
...

Cluster Hierarchy:

Parent, Child
B, C
B, D
...

As you can see, the observations 3,4 and 5 occur multiple times in the cluster assignment table, where one of the cluster is a subcluster of its parent cluster. This can be seen in the Hierarchy table.

My current approach is that I use the cutree.dendogram function from the "Dendextend" package to find the cluster assignment for a grid of values of k, and then deduce the cluster hierarchy and assignments from the output. However, this approach is very naïve and becomes terribly slow for large numbers of observations and clusters.

Suggestions on how to tackle this problem efficiently, preferrably using some readily available packages would be greatly appreciated.

EDIT: Consider the following example, corresponding to the sample output data:

data <- matrix(data = c(1,2,3,4,5,1,3,5,9,10), nrow = 5, ncol = 2)
plot(data)

hc<- hclust(dist(data))
plot(hc)

If we would cut the tree at height 6, we would obtain 2 clusters, named A and B in the output. However, should we cut the tree at height 4, we would have 3 clusters, named A, C and D in the output. Now for example the observation with docID 3 is in either cluster B as in cluster C (depending on the height we cut the tree), which corresponds to the two entries in the sample output cluster assignment. The cluster B is split in the two cluster C and D, which can be seen in the cluster Hierachy output.

Now my goal is to obtain the complete list of cluster assignment and hierarchies, moving down the dendrogram. (preferably stopping when the amount of observations in a (sub)cluster obtain less then a certain amount of observations)

Untill now i failed to come up with a reasonable efficient method to do this, hopefully anyone could provide me with an idea.

1
Hi @Bartdp1, could you please update your question with a self contained reproducible example?Tal Galili
@TalGalili i updated the question, thanks for taking a lookBDP1

1 Answers

2
votes

The k parameter in the cutree function can receive a vector of values and the output would look like this:

> data <- matrix(data = c(1,2,3,4,5,1,3,5,9,10), nrow = 5, ncol = 2)
> hc<- hclust(dist(data))
> cutree(hc, k = 1:5)
     1 2 3 4 5
[1,] 1 1 1 1 1
[2,] 1 1 1 2 2
[3,] 1 1 2 3 3
[4,] 1 2 3 4 4
[5,] 1 2 3 4 5

Does this answer your question?