I have an issue understanding cluster assignment in k-means clustering. Specifically, I know that the point is assigned to closest cluster (shortest distance to cluster center), but I wasn't able to reproduce results. Details are given below.
Let's say I have a data frame df1:
set.seed(16)
df1 = data.frame(matrix(sample(1:50, replace = T), ncol=10, nrow=10000))
head(df1, n=4)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 35 35 35 35 35 35 35 35 35 35
2 13 13 13 13 13 13 13 13 13 13
3 23 23 23 23 23 23 23 23 23 23
4 12 12 12 12 12 12 12 12 12 12
On that data frame I want to perform k-means clustering (with scaling):
for_clst_km = scale(df1, center=F) #standardization with z-scores
kclust = 6 #number of clusters
Clusters <- kmeans(for_clst_km, kclust)
After the clustering is finished I can assign clusters to original data frame:
df1$cluster = Clusters$cluster
For testing purposes let's pick up cluster No 3.
library(dplyr)
cluster3 = df1 %>% filter(cluster == 3)
Because I want to scale the cluster3 first I need to delete the cluster column and then to perform z-standardization:
cluster3$cluster = NULL
cluster3_1 = (cluster3-colMeans(df1))/apply(df1,2,sd)
Now, when I have scaled values in cluster3_1 I can calculate distance to center points of each cluster:
centroids = data.matrix(Clusters$centers)
dist_to_clust1 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[1,])^2)))
dist_to_clust2 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[2,])^2)))
dist_to_clust3 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[3,])^2)))
dist_to_clust4 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[4,])^2)))
dist_to_clust5 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[5,])^2)))
dist_to_clust6 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[6,])^2)))
dist_to_clust = cbind(dist_to_clust1, dist_to_clust2, dist_to_clust3, dist_to_clust4, dist_to_clust5, dist_to_clust6)
Finally after observing the distances to each cluster it is evident that I am doing something wrong. For example, looking at the fifth row I see that the point is closest to cluster 4 (e.g. this is the smallest value).
head(dist_to_clust)
dist_to_clust1 dist_to_clust2 dist_to_clust3 dist_to_clust4 dist_to_clust5 dist_to_clust6
[1,] 11.015929 11.116591 10.946547 11.173597 11.034535 10.968986
[2,] 13.136060 12.848511 12.967084 13.379930 12.840414 12.861085
[3,] 13.681588 13.314994 13.492713 13.942535 13.322293 13.360695
[4,] 10.506083 10.725233 10.467843 10.636465 10.621233 10.529714
[5,] 2.157906 5.392285 3.120574 1.168265 4.855553 4.197457
[6,] 11.015929 11.116591 10.946547 11.173597 11.034535 10.968986
I believe there is a mistake with the methodology in scaling. I am not sure if I can actually scale the cluster 3 points with means and standard deviations of the entire data frame.
Can you please share your thoughts, what am I doing wrong? Thank you very much!
head(df1)
. After your test data is fixed, we can dig into your clustering. – G5Wscale
does? But it shows you how vulnerable your result is to scaling differences. – Has QUIT--Anony-Mousse