3
votes

Ive written a function which does kmeans clustering on normal distributions. The function can be used on both one dimensional and two dimensional normal distributions. Plotting the 1d kmeans clustering is easy and can be done b using the following:

    plot(data[idx==0,0],data[idx==0,1],'ob',
         data[idx==1,0],data[idx==1,1],'or',
         data[idx==2,0],data[idx==2,1],'og',
         data[idx==3,0],data[idx==3,1],'oy',
         data[idx==4,0],data[idx==4,1],'oc')

    plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
    show()

which will give a plot like this:

kmeans on 1-d normal distributions

a plot of a 2d normal distribution looks like:

a 2-d normal distribution

A 2-D normal distribution has mean = [a b] and var = [[p q],[r s]] The centroids obtained for clustering of 2d distributions also have the same shape as the mean and var of the points (obviously). The problem I'm facing is with plotting of this data. How can this be visualized using python and matplotlib. So the points in 1-d case will be replaced by ellipses and the centroid will also be an ellipse. The clustering should look something like: example of the result

where black ellipses are the 2d distributions and red ones are detected 2d centroids.

The plot function that im using to plot a single 2d distribution is:

def plot2DND(mean, variance):
    mean1 = mean.flatten()
    cov1 = variance

    nobs = 2500
    rvs1 = np.random.multivariate_normal(mean1, cov1, size=nobs)

    plt.plot(rvs1[:, 0], rvs1[:, 1], '.')
    plt.axis('equal')
    plt.show()

The following figure gives a better visualization of the requirement(From: http://www.lix.polytechnique.fr/~nielsen/pdf/2008-C-ClusteringNormal-ETVC.pdf)

better visualization of the required plot

Is it possible to achieve something like this using python and matplotlib(or other libs). Or is a better visualization possible for this type of data?

1
And the output from the kmeans you have is a list of 2d means and 2d variances?askewchan
yes. I have the list of 2d means and 2d variance valuesAbhishek Thakur

1 Answers

2
votes

The first thing you need to do is change your plotting function so that it plots a single contour instead all the points. You can use Ellipse to do this, and use the eigenvalues and eigenvectors of your variance matrix to find the angle (I hope I did this right, it might require variance to be symmetric).

from matplotlib.patches import Ellipse

def plot_ellipse(mean, var, ec='k', alpha=1):
    evals, evecs = np.linalg.eig(var)
    ang = np.degrees(np.arctan2(*evecs[1]))
    ell = Ellipse(mean, *np.abs(evals), angle=ang, fc='None', ec=ec, alpha=alpha)
    plt.gca().add_artist(ell)

So, let's say you've done whatever you need with your data and you end up with something like mean_centroids and variance_centroids which would have shapes (k, 2) and (k, 2, 2).

colors = ['r', 'g', 'b'] # length of this should be `k`

for i, (m, v) in enumerate(zip(mean_centroids, variance_centroids)):
    plot_ellipse(m, v, ec=colors[i])

You probably have a data array of lots of means and variances, so you can just loop through it all, coloring by the labels, which you'd get from centroids, labels = kmeans2(data, k):

for i, m, v in zip(labels, means, variances):
    plot_ellipse(m, v, ec=colors[i], alpha=.5)

By the way, you can replace your first scatter plot example with:

colors = ['b', 'r', 'g', 'y', 'c']
plt.scatter(*data.T, c=np.choose(ids, colors))

plt.plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()