2
votes

Here is an implementation of kmeans algorithm that I put together from the kmeans scikit documentation and a blog post discussing kmeans :

#http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
#http://fromdatawithlove.thegovans.us/2013/05/clustering-using-scikit-learn.html

from sklearn.cluster import KMeans
import numpy as np
from matplotlib import pyplot

X = np.array([[10, 2 , 9], [1, 4 , 3], [1, 0 , 3],
               [4, 2 , 1], [4, 4 , 7], [4, 0 , 5], [4, 6 , 3],[4, 1 , 7],[5, 2 , 3],[6, 3 , 3],[7, 4 , 13]])
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)

k = 3
kmeans.fit(X)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_

for i in range(k):
    # select only data observations with cluster label == i
    ds = X[np.where(labels==i)]
    # plot the data observations
    pyplot.plot(ds[:,0],ds[:,1],'o')
    # plot the centroids
    lines = pyplot.plot(centroids[i,0],centroids[i,1],'kx')
    # make the centroid x's bigger
    pyplot.setp(lines,ms=15.0)
    pyplot.setp(lines,mew=2.0)
pyplot.show()

print(kmeans.cluster_centers_.squeeze())

How to print / access the data points of each of the k clusters .

if k = 3 : 
cluster 1 : [10, 2 , 9], [1, 4 , 3], [1, 0 , 3]                  
cluster 2 : [4, 0 , 5], [4, 6 , 3],[4, 1 , 7],[5, 2 , 3],[6, 3 , 3],[7, 4 , 13]
cluster 3 : [4, 2 , 1], [4, 4 , 7]

Reading http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html there is not a attribute or method on the kmeans object for this ?

Update :

kmeans.labels_ returns array([1, 0, 2, 0, 2, 2, 0, 2, 0, 0, 1], dtype=int32)

But how does this show the data points in each of the 3 clusters ?

2
Not a method, no....look closer at the documentation in your link. - user554546
@JackManey closest I found are print(kmeans.labels_), print(kmeans.get_params),print(kmeans.cluster_centers_ ) but none of these attributes print the cluster values. - blue-sky
...what do you mean, exactly by "cluster values"? - user554546
@JackManey I realize now 'values' is ambiguous . By values I mean 'data points' , I've updated question to this effect. - blue-sky
Ah, in that case, kmeans.labels_ gives you the cluster assignments for each corresponding data point (remember that rows of NumPy arrays are in a fixed order!). - user554546

2 Answers

1
votes

If you use the _labels attribute of your fit KMeans object you'll get a an array of the cluster assignment for each training vector. The ordering of the labels array is the same as your training data, so you can zip them or do a numpy.where() for each unique label.

1
votes

To access the data points post k-means clustering :

added code :

sortedR = sorted(result, key=lambda x: x[1])
sortedR

Complete code :

    from sklearn.cluster import KMeans
    import numpy as np
    from matplotlib import pyplot

    X = np.array([[10, 2 , 9], [1, 4 , 3], [1, 0 , 3],
                   [4, 2 , 1], [4, 4 , 7], [4, 0 , 5], [4, 6 , 3],[4, 1 , 7],[5, 2 , 3],[6, 3 , 3],[7, 4 , 13]])
    kmeans = KMeans(n_clusters=3, random_state=0).fit(X)

    k = 3
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)

    labels = kmeans.labels_
    centroids = kmeans.cluster_centers_

    for i in range(k):
        # select only data observations with cluster label == i
        ds = X[np.where(labels==i)]
        # plot the data observations
        pyplot.plot(ds[:,0],ds[:,1],'o')
        # plot the centroids
        lines = pyplot.plot(centroids[i,0],centroids[i,1],'kx')
        # make the centroid x's bigger
        pyplot.setp(lines,ms=15.0)
        pyplot.setp(lines,mew=2.0)
    pyplot.show()

result = zip(X , kmeans.labels_)

sortedR = sorted(result, key=lambda x: x[1])
sortedR