1
votes

I want to carry out hierarchical clustering in Matlab and plot the clusters on a scatterplot. I have used the evalclusters function to first investigate what a 'good' number of clusters would be using different criteria values eg Silhouette, CalinskiHarabasz. Here is the code I used for the evaluation (x is my data with 200 observations and 10 variables):

E = evalclusters(x,'linkage','CalinskiHarabasz','KList',[1:10])
%store kmean optimal clusters
optk=E.OptimalK;
%save the outouts to a structure
clust_struc(1).Optimalk=optk;
clust_struc(1).method={'CalinskiHarabasz'}

I then used code similar to what I have found online:

gscatter(x(:,1),x(:,2),E.OptimalY,'rbgckmr','xod*s.p')
%OptimalY is a vector 200 long with the cluster numbers

and this is what I get:

clusters

My question may be silly, but I don't understand why I am only using the first two columns of data to produce the scatter plot? I realise that the clusters themselves are being incorporated through the use of the Optimal Y, but should I not be using all of the data in x?

1
My question may also be silly, but this scatter plot has 2 dimensions, why are you thinking more data is needed? and what would you do with it? - EBH
Well, there's more data in the original data is 200x10, so I was wondering why only the first two variables are included? - new2matlab

1 Answers

0
votes

Each row in x is an observation with properties in size(x,2) dimensions. All this dimensions are used for clustering x rows.

However, when plotting the clusters, we cannot plot more than 2-3 dimensions so we try to represent each element with its key properties. I'm not sure that x(:,1),x(:,2) are the best option, but you have to choose 2 for a 2-D plot.

Usually you would have some property of interest that you want to plot. Have a look at the example in MATLAB doc: the fisheriris data has 4 different variables - the length and width measurements from the sepals and petals of three species of iris flowers. It is up to you to decide which you want to plot against each other (in the example they choosed Petal Length and Petal Width).

Here is a comparison between taking Petals measurements and Sepals measurements as the axis for plotting the grouping:

clustering examples