I am doing some cluster analysis with R. I am using the hclust()
function and I would like to get, after I perform the cluster analysis, the cluster representative of each cluster.
I define a cluster representative as the instances which are closest to the centroid of the cluster.
So the steps are:
- Finding the centroid of the clusters
- Finding the cluster representatives
I have already asked a similar question but using K-means: https://stats.stackexchange.com/questions/251987/cluster-analysis-with-k-means-how-to-get-the-cluster-representatives
The problem, in this case, is that hclust
doesn't give the centroids!
For example, saying that d
are my data, what I have done so far is:
hclust.fit1 <- hclust(d, method="single")
groups1 <- cutree(hclust.fit1, k=3) # cut tree into 3 clusters
## getting centroids ##
mycentroid <- colMeans(CV)
clust.centroid = function(i, dat, groups1) {
ind = (groups1 == i)
colMeans(dat[ind,])
}
centroids <- sapply(unique(groups1), clust.centroid, data, groups1)
But now, I was trying to get the cluster representatives with this code (I got it in the other question I asked, for k-means):
index <- c()
for (i in 1:3){
rowsum <- rowSums(abs(CV[which(centroids==i),1:3] - centroids[i,]))
index[i] <- as.numeric(names(which.min(rowsum)))
}
And it says that:
"Error in e2[[j]] : index out of the limit"
I would be grateful if any of you could give me a little help. Thanks.
-- (not) Working example of the code --
example_data.txt
A,B,C
10.761719,5.452188,7.575762
10.830457,5.158822,7.661588
10.75391,5.500170,7.740330
10.686719,5.286823,7.748297
10.864527,4.883244,7.628730
10.701415,5.345650,7.576218
10.820583,5.151544,7.707404
10.877528,4.786888,7.858234
10.712337,4.744053,7.796390
As for the code:
# Install R packages
#install.packages("fpc")
#install.packages("cluster")
#install.packages("rgl")
library(fpc)
library(cluster)
library(rgl)
CV <- read.csv("example_data")
str(CV)
data <- scale(CV)
d <- dist(data,method = "euclidean")
hclust.fit1 <- hclust(d, method="single")
groups1 <- cutree(hclust.fit1, k=3) # cut tree into 3 clusters
mycentroid <- colMeans(CV)
clust.centroid = function(i, dat, groups1) {
ind = (groups1 == i)
colMeans(dat[ind,])
}
centroids <- sapply(unique(groups1), clust.centroid, CV, groups1)
index <- c()
for (i in 1:3){
rowsum <- rowSums(abs(CV[which(centroids==i),1:3] - centroids[i,]))
index[i] <- as.numeric(names(which.min(rowsum)))
}