1
votes

In k-means clustering, how to start with the process?

should i choose k farthest points or random points and form k clusters and joining other points to clusters?

or

choosing a single point and then, checking other points against it [euclidean distance] if < THRESHOLD add or form new cluster?

2

2 Answers

1
votes

To seed the K-Means algorithm, it's standard to choose K random observations from your data set. Since K-Means is subject to local optima (e.g., depending on the initialization it doesn't always find the best solution), it's also standard to run it several times with different initializations and choose the result with the lowest error.

0
votes

The original MacQueen k-means used the first k objects as initial configuration. Forgy/Lloyd seem to use k random objects. Both will work good enough, but more clever heuristics (see k-means++) may require fewer iterations.

Note that k-means is not distance based. It minimizes the within-cluster-sum-of-squares (WCSS). Which happens to minimize squared Euclidean distances and thus Euclidean distances. But in the end, it may yield incorrect conclusions if you think in Euclidean distances. Better think of minimizing variance.