1
votes

Is there a way to perform sequential k-means clustering using scikit-learn? I can't seem to find a proper way to add new data, without re-fitting all the data.

Thank you

3

3 Answers

7
votes

scikit-learn's KMeans class has a predict method that, given some (new) points, determines which of the clusters these points would belong to. Calling this method does not change the cluster centroids.

If you do want the centroids to be changed by the addition of new data, i.e. you want to do clustering in an online setting, use the MiniBatchKMeans estimator and its partial_fit method.

3
votes

You can pass in initial values for the centroids with the init parameter to sklearn.cluster.kmeans. So then you can just do:

centroids, labels, inertia = k_means(data, k)
new_data = np.append(data, extra_pts)
new_centroids, new_labels, new_inertia = k_means(new_data, k, init=centroids)

assuming you're just adding data points and not changing k.

I think this will sometimes mean you get a suboptimal result, but it should usually be faster. You might want to occasionally redo the fit with, say, 10 random seeds and take the best one.

1
votes

It's also relatively easy to write your own function that finds out which centroid is closest to a point that you are considering. Assuming you have some matrix X that is ready for kmeans:

centroids, labels, inertia = cluster.k_means(X, 5)

def pred(arr):
    return np.argmin([np.linalg.norm(arr-b) for b in centroids])

You can confirm that this works via:

[pred(X[i]) == labels[i] for i in range(len(X))]