3
votes

Example:

load kmeansdata %provides X variable
Y=bsxfun(@minus,X,mean(X,2))'/sqrt(size(X,2)-1); %normalized and means adjusted
[~,~,PC] = svd(Y); %
plot(PC(:,1),PC(:,2),'m.','markersize',15)

plot the first two columns and you will get what looks like 3 clusters. I want to identify these clusters using kmeans, and plot the clusters in different colours as prood. I tried:

[idx,cntrd] = kmeans(PC(:,1:2),3,'Distance','sqEuclidean');%,'Distance','correlation');

cluster=3;
Col = {'.b','.r','.g','.y','.m','.c','.k'}; % Cell array of colours.
figure;
hold on
for clus=1:cluster
  plot(PC(idx==clus,1),PC(idx==clus,2),Col{clus},'MarkerSize',12)  
end
plot(cntrd(:,1),cntrd(:,2),'kx','MarkerSize',15,'LineWidth',3) %plotting the centroids of the clusters

The cluster centroids are off, and the colours aren't what I expected either. Can anyone help?

EDIT: Somewhat answered:

I copied this code from the mathworks site and replaced my kmeans line:

opts = statset('Display','final');
[idx,C] = kmeans(PC(:,1:2),3,'Distance','cityblock',...
    'Replicates',5,'Options',opts);

it seems to work, but I don't quite understand what opts does. Replicates, I assume, just repeats kmeans 5 times, and picks some kind of average for the centroids. I've also restarted matlab in case there was some sort of glitch

EDIT: ignore above:

I thought the problem was resolved, so then I tried looking into finding appropriate k values. I entered k=1, ran everything, then k=2, then k=3 and I noticed I got the same mistake again

1
First suggestion (just a side note really) is that you can easily plot the groups as different colours using gscatter. And secondly, have you tried using the 'Replicates',5 option but sticking with the default Euclidean distance rather than the using cityblock? Also try leaving off the opts part, maybe you don't need it... - Dan
look at mathworks.com/help/stats/statset.html for what opts is doing. It seems like the Display property only affects the console output of your function, i.e. what feedback it gives you. btw I think you are right re replicate: mathworks.com/help/stats/kmeans.html#bueftl4-1 - Dan
@Dan I ran the code that was shown in my first "EDIT: Somewhat answered:" section (which included the cityblock parameter, and it no longer works as expected. I ran the code again without the distance/cityblock pair so that it uses the default and i still no longer works as expected. In both cases, the centroids are wrong (two of them are in the middle cluster, the third one looks correct), and the colour scheme for both is wrong (the top half of the left and middle clusters are red, the bottoms are blue, although the right cluster seems correctly coloured green). - CaptainObv
Did you try increasing the replicate parameter number larger than 5? - Dan
@Dan I changed replicates to 10 (again no distance name/pair), and it is worse. Before at least the last (right side) cluster had a correct centroid and coloured in everyhing green, now all 3 clusters are wrong. Would pictures help? - CaptainObv

1 Answers

0
votes

kmeans can be sensitive to the initial centroid locations. The trouble seems to be the algorithm used for selecting the starting points. for example, you can get the expected answer by running this:

[idx,cntrd] = kmeans(PC(:,1:2),3, 'start', [-0.05 0; 0 0; 0.05  0]);

Looks can also be deceiving. In this case the dispersion of the data is not equal in the x and y dimensions. Thus, for some pairs of points, the euclidean distance is not as far between the visual clusters as it is within clusters.

You might consider using a mixture of Guassian distributions model for this data.