0
votes

I have an issue understanding cluster assignment in k-means clustering. Specifically, I know that the point is assigned to closest cluster (shortest distance to cluster center), but I wasn't able to reproduce results. Details are given below.

Let's say I have a data frame df1:

set.seed(16)
df1 = data.frame(matrix(sample(1:50, replace = T), ncol=10, nrow=10000))
head(df1, n=4)

  X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 35 35 35 35 35 35 35 35 35  35
2 13 13 13 13 13 13 13 13 13  13
3 23 23 23 23 23 23 23 23 23  23
4 12 12 12 12 12 12 12 12 12  12

On that data frame I want to perform k-means clustering (with scaling):

for_clst_km = scale(df1, center=F) #standardization with z-scores

kclust = 6 #number of clusters
Clusters <- kmeans(for_clst_km, kclust)

After the clustering is finished I can assign clusters to original data frame:

df1$cluster = Clusters$cluster

For testing purposes let's pick up cluster No 3.

library(dplyr)
cluster3 = df1 %>% filter(cluster == 3)

Because I want to scale the cluster3 first I need to delete the cluster column and then to perform z-standardization:

cluster3$cluster = NULL

cluster3_1 = (cluster3-colMeans(df1))/apply(df1,2,sd)

Now, when I have scaled values in cluster3_1 I can calculate distance to center points of each cluster:

centroids = data.matrix(Clusters$centers)

dist_to_clust1 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[1,])^2)))
dist_to_clust2 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[2,])^2)))
dist_to_clust3 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[3,])^2)))
dist_to_clust4 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[4,])^2)))
dist_to_clust5 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[5,])^2)))
dist_to_clust6 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[6,])^2)))

dist_to_clust = cbind(dist_to_clust1, dist_to_clust2, dist_to_clust3, dist_to_clust4, dist_to_clust5, dist_to_clust6)

Finally after observing the distances to each cluster it is evident that I am doing something wrong. For example, looking at the fifth row I see that the point is closest to cluster 4 (e.g. this is the smallest value).

head(dist_to_clust)

     dist_to_clust1 dist_to_clust2 dist_to_clust3 dist_to_clust4 dist_to_clust5 dist_to_clust6
[1,]      11.015929      11.116591      10.946547      11.173597      11.034535      10.968986
[2,]      13.136060      12.848511      12.967084      13.379930      12.840414      12.861085
[3,]      13.681588      13.314994      13.492713      13.942535      13.322293      13.360695
[4,]      10.506083      10.725233      10.467843      10.636465      10.621233      10.529714
[5,]       2.157906       5.392285       3.120574       1.168265       4.855553       4.197457
[6,]      11.015929      11.116591      10.946547      11.173597      11.034535      10.968986

I believe there is a mistake with the methodology in scaling. I am not sure if I can actually scale the cluster 3 points with means and standard deviations of the entire data frame.

Can you please share your thoughts, what am I doing wrong? Thank you very much!

2
One problem is your data generation. You only generate 100 distinct points and then put them into a 100 x 10 matrix, so you get 10 identical columns. Try head(df1). After your test data is fixed, we can dig into your clustering.G5W
@G5W The same issue appears with data frame of over 600k rows. Therefore, I believe something is wrong with my methodology.Makaroni
Perhaps, but you have not given us a reasonable test case to resolve the problem.G5W
@G5W See my Edited question. Same thing with 10000 rows... The 100 distinct points are there because I have such problem in reality.Makaroni
Because the latter isn't exactly what scale does? But it shows you how vulnerable your result is to scaling differences.Has QUIT--Anony-Mousse

2 Answers

1
votes

From my answers at cross validated:


It's because df-colmeans(df) doesn't do what you think.

Let's try the code:

a=matrix(1:9,nrow=3)

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

colMeans(a)

[1] 2 5 8

a-colMeans(a)

     [,1] [,2] [,3]
[1,]   -1    2    5
[2,]   -3    0    3
[3,]   -5   -2    1

apply(a,2,function(x) x-mean(x))

     [,1] [,2] [,3]
[1,]   -1   -1   -1
[2,]    0    0    0
[3,]    1    1    1

you'll find that a-colMeans(a) does a different thing than apply(a,2,function(x) x-mean(x)), which is what you'll want for centering.

You could write an apply to do the full autoscaling for you:

apply(a,2,function(x) (x-mean(x))/sd(x))

     [,1] [,2] [,3]
[1,]   -1   -1   -1
[2,]    0    0    0
[3,]    1    1    1

scale(a)

     [,1] [,2] [,3]
[1,]   -1   -1   -1
[2,]    0    0    0
[3,]    1    1    1
attr(,"scaled:center")
[1] 2 5 8
attr(,"scaled:scale")
[1] 1 1 1

But there's no point in doing that apply, since scale will do it for you. :)


Moreover, to try out the clustering:

set.seed(16)
nc=10
nr=10000
# Make sure you draw enough samples: There was extreme periodicity in your sampling
df1 = matrix(sample(1:50, size=nr*nc,replace = T), ncol=nc, nrow=nr)
head(df1, n=4)

for_clst_km = scale(df1) #standardization with z-scores
nclust = 4 #number of clusters
Clusters <- kmeans(for_clst_km, nclust)

# For extracting scaled values: They are already available in for_clst_km
cluster3_sc=for_clst_km[Clusters$cluster==3,]

# Simplify code by putting distance in function
distFun=function(mat,centre) apply(mat, 1, function(x) sqrt(sum((x-centre)^2)))

centroids=Clusters$centers
dists=matrix(nrow=nrow(cluster3_sc),ncol=nclust) # Allocate matrix
for(d in 1:nclust) dists[,d]=distFun(cluster3_sc,centroids[d,])  # Calculate observation distances to centroid d=1..nclust

whichMins=apply(dists,1,which.min) # Calculate the closest centroid per observation
table(whichMins) # Tabularize

> table(whichMins)
whichMins
   3 
2532 

HTH HAND,
Carl

0
votes

Your hand-written scaling code is broken. Check the standard deviation of the resulting data, it isn't 1.

Why don't you just use

cluster3 = for_clst_km %>% filter(cluster == 3)