Scikit-learn kmeans clustering

Question

I'm supposed to be doing a kmeans clustering implementation with some data. The example I looked at from http://glowingpython.blogspot.com/2012/04/k-means-clustering-with-scipy.html shows their test data in 2 columns... however, the data I'm given is 68 subjects with 78 features (so 68x78 matrix). How am I supposed to create an appropriate input for this?

I've basically just tried inputting the matrix anyway, but it doesn't seem to do what I want... and I don't know why it would. I'm pretty confused as to what to do.

        data = np.rot90(data)
        centroids,_ = kmeans(data,2)
        # assign each sample to a cluster
        idx,_ = vq(data,centroids)

        # some plotting using numpy's logical indexing
        plot(data[idx==0,0],data[idx==0,1],'ob',
             data[idx==1,0],data[idx==1,1],'or')
        plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
        show()

I honestly don't know what kind of code to show you.. the data format I told you was already described. Otherwise, it's the same as the tutorial I linked.

Show the code that you've actually tried. What specific error messages or unexpected behaviour are you seeing, other than "doesn't seem to do what I want"? — Amit Kumar Gupta
i.imgur.com/ILmPQS3.png I don't understand why some that are closer to one cluster are labeled as another... and honestly I was kinda hoping that they'd be more separated (but I suppose it's possible they aren't). — Programmermatt
Looks like you're using scipy not scikit-learn. Should probably change the tag. I'm fairly new to kmeans myself. With that said, 78 features seems like a lot. Are all 78 numeric, non categorical variables? — Bob Haffner
Yeah, they are. That being said, is my data even organized correctly? The example makes it seem like x,y coordinates.. I have no idea how my data could even be graphed. I simply have 68 vectors of length 78. I feel like I'm really missing out on something and becoming frustrated. — Programmermatt
You're only plotting 2 of 78 dimensions. The blue points that look closer to the red centroid in this 2-dimensional projection are actually closer to the blue centroid in the full 78-dimensional space. — Amit Kumar Gupta

Has QUIT--Anony-Mousse Has QUIT--Anony-Mousse · Accepted Answer · 2014-12-22T10:51:36

Your visualization only uses the first two dimensions.

That is why these points appear to be "incorrect" - they are closer in a different dimension.

Have a look at the next two dimensions:

plot(data[idx==0,2],data[idx==0,3],'ob',
         data[idx==1,2],data[idx==1,3],'or')
    plot(centroids[:,2],centroids[:,3],'sg',markersize=8)
    show()

... repeat for all remaining of oyur 78 dimensions...

At this many features, (squared) Euclidean distance gets meaningless, and k-means results tend to become as good as random convex partitions.

To get a more representative view, consider using MDS to project the data into 2d for visualization. It should work reasonably fast with just 68 subjects.

Please include visualizations in your questions. We don't have your data.

Scikit-learn kmeans clustering

1 Answers