0
votes

I am trying to cluster a 2 dimensional user data using kmeans in sklearn python. I used the elbow method (point where the increase in cluster no. does not bring significant dip in the sum of square errors) to identify the correct no. of clusters as 50.

Post applying kmeans, i wish to understand the similarity of datapoints within each cluster. Since i have 50 clusters, is there a way to get a number (something like variance within each cluster) which could help me understand how close or datapoints within each of them. A number like 0.8 would mean that the records have high variance within each cluster while a 0.2 would mean they are closely "related".

So to summarize, is there any way to get a single number to identify how "good" each cluster in kmeans is? We can argue that goodness is relative, but lets consider that i am more interested in the within cluster variance to identify how good a particular cluster is.

1
there are two similarities regarding to clustering: inter cluster similarity and intra-cluster similarity inter-cluster: between cluster, should be high intra-cluster: within cluster should be small I suggest looking at en.wikipedia.org/wiki/Silhouette_(clustering) for farther reading and understandingshahaf
Thanks Shahaf.. Ive seen silhoutte coefficient being used to identify the k value for kmeans but post finding the "ideal" k, can you guide me on how silhoutte can be used on each & every cluster ? A python code would be really helpful..Sundaresh Prasanna
I'm facing the same issue, I'm using silhouette score to find the best K clusters, as you mentioned the silhouette method can be used to calc the similarity for each sample like so scikit-learn.org/stable/modules/generated/…shahaf
Within cluster sum of squares and variance are not limited to 0:1. It is trivial to compute these values yourself, but they won't be very useful.Has QUIT--Anony-Mousse

1 Answers

0
votes

code example using the silhouette score taking from https://plot.ly/scikit-learn/plot-kmeans-silhouette-analysis/

from __future__ import print_function

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

# Generating the sample data from make_blobs
# This particular setting has one distinct cluster and 3 clusters placed close
# together.
X, y = make_blobs(n_samples=500,
                  n_features=2,
                  centers=4,
                  cluster_std=1,
                  center_box=(-10.0, 10.0),
                  shuffle=True,
                  random_state=1)  # For reproducibility

range_n_clusters = [2, 3, 4, 5, 6]

for n_clusters in range_n_clusters:
  # Initialize the clusterer with n_clusters value and a random generator
  # seed of 10 for reproducibility.
  clusterer = KMeans(n_clusters=n_clusters, random_state=10)
  cluster_labels = clusterer.fit_predict(X)
  print(cluster_labels)
  # The silhouette_score gives the average value for all the samples.
  # This gives a perspective into the density and separation of the formed
  # clusters
  silhouette_avg = silhouette_score(X, cluster_labels)
  print("For n_clusters =", n_clusters,
        "The average silhouette_score is :", silhouette_avg)

  # Compute the silhouette scores for each sample
  sample_silhouette_values = silhouette_samples(X, cluster_labels)