2
votes

I am confused as I see different ways to implement the elbow method to identify the correct number of clusters in Kmean and they produce slightly different results.

One method is described here Sklearn kmeans equivalent of elbow method and is using kmeans_inertia_ the other methos is described here https://pythonprogramminglanguage.com/kmeans-elbow-method/ and is using the following command.

distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0]) I am wondering what does Kmeans_inertia_ does ? and are both implementation correct ?

4

4 Answers

2
votes

There is no "correct" for something that is not at all well-defined.

The elbow method is an extremely crude heuristic for which I am not aware of any formal definition, nor a reference.

Both methods will supposedly most often yield the same k...

But by the concept of k-means, the "correct" way to use it is with squared errors, not with Euclidean distance. Because k-means minimizes squared errors, it does not minimize Euclidean distances (try to prove this! You can't because there are counterexamples).

0
votes

Reading the documentation for KMeans you can see that both distortion and inertia both are the sum of distance for each point to its center.

0
votes

Both the scikit-Learn User Guide on KMeans and Andrew Ng's CS229 Lecture notes on k-means indicate that the elbow method minimizes the sum of squared distances between cluster points and their cluster centroids. The sklearn documentation calls this "inertia" and points out that it is subject to the drawback of inflated Euclidean distances in high-dimensional spaces. Ng calls this minimization of the "distortion function". One can, however, find examples of the definition of distortion being "the average of the squared distances from the cluster centers of the respective clusters" (emphasis mine) in contrast to "inertia" being the sum of the squared distances. While confusing, and confirming these terms are not consistently defined, it suggests to me both work.

0
votes

I find many examples around that use this formula, claiming that they calculate the distortion. However in my understanding is seems wrong:

distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])

If you want to calculate the sum of squared distances, it should be:

distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)**2) / X.shape[0])