4
votes

I have to perform a cluster analysis on a big amount of data. Since I have a lot of missing values I made a correlation matrix.

corloads = cor(df1[,2:185], use = "pairwise.complete.obs")

Now I have problems how to go on. I read a lot of articles and examples, but nothing really works for me. How can I find out how many clusters are good for me?

I already tried this:

dissimilarity = 1 - corloads
distance = as.dist(dissimilarity) 

plot(hclust(distance), main="Dissimilarity = 1 - Correlation", xlab="") 

I got a plot, but its very messy and I dont know how to read it and how to go on. It looks like this:

enter image description here

Any idea how to improve it? And what can I actually get out of it?

I also wanted to create a Screeplot. I read that there will be a curve where you can see how many clusters are correct.

I also performed a cluster analysis and choose 2-20 Clusters, but the results are so long, I have no idea how to handle it and what things are important to look on.

2
You can arbitrary determine the number of clusters. It is very controversy how to determine the number of clusters, but some analysis can help you out. Take a look at kgs function from maptree package.patL
I looked it up and read that it should help you to find a good number of clusters. But I am totally new to that subject. What do I have to enter as cluster and diss? Is my corloads = cluster and diss = distance?? kgs (cluster, diss, alpha=1, maxclust=NULL)Essi

2 Answers

3
votes

To determine the "optimal number of clusters" several methods are available, despite it is a controversy theme.

The kgs is helpful to get the optimal number of clusters.

Following your code one would do:

clus <- hclust(distance)
op_k <- kgs(clus, distance, maxclus = 20)
plot (names (op_k), op_k, xlab="# clusters", ylab="penalty")

So the optimal number of clusters according to the kgs function is the minimum value of op_k, as you can see in the plot. You can get it with

min(op_k)

Note that I set the maximum number of clusters allowed to 20. You can set this argument to NULL.

Check this page for more methods.

Hope it helps you.

Edit

To find which is the optimal number of clusters, you can do

op_k[which(op_k == min(op_k))]

Plus

Also see this post to find the perfect graphy answer from @Ben

Edit

op_k[which(op_k == min(op_k))]

still gives penalty. To find the optimal number of clusters, use

as.integer(names(op_k[which(op_k == min(op_k))]))
1
votes

I'm happy to learn about the kgs function. Another option is using the find_k function from the dendextend package (it uses the average silhouette width). But given the kgs function, I might just add it as another option to the package. Also note the dendextend::color_branches function, to color your dendrogram with the number of clusters you end up choosing (you can see more about this here: https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html#setting-a-dendrograms-branches )