Is there a way to perform sequential k-means clustering using scikit-learn? I can't seem to find a proper way to add new data, without re-fitting all the data.
Thank you
scikit-learn's KMeans
class has a predict
method that, given some (new) points, determines which of the clusters these points would belong to. Calling this method does not change the cluster centroids.
If you do want the centroids to be changed by the addition of new data, i.e. you want to do clustering in an online setting, use the MiniBatchKMeans
estimator and its partial_fit
method.
You can pass in initial values for the centroids with the init
parameter to sklearn.cluster.kmeans
. So then you can just do:
centroids, labels, inertia = k_means(data, k)
new_data = np.append(data, extra_pts)
new_centroids, new_labels, new_inertia = k_means(new_data, k, init=centroids)
assuming you're just adding data points and not changing k
.
I think this will sometimes mean you get a suboptimal result, but it should usually be faster. You might want to occasionally redo the fit with, say, 10 random seeds and take the best one.
It's also relatively easy to write your own function that finds out which centroid is closest to a point that you are considering. Assuming you have some matrix X
that is ready for kmeans:
centroids, labels, inertia = cluster.k_means(X, 5)
def pred(arr):
return np.argmin([np.linalg.norm(arr-b) for b in centroids])
You can confirm that this works via:
[pred(X[i]) == labels[i] for i in range(len(X))]