4
votes

I am trying to cluster some products based on the users' behaviors. What I reach at the end are clusters that have a very different number of observations.

I have checked k-means clustering parameters and was not able to find a parameter that controls the minimum (or maximum) number of observations per cluster.

For example here is how the number of observations is distributed across different clusters.

cluster_id   num_observations
0   6
1   4
2   1
3   3
4   29
5   5

Any help on how to deal with this issue? Any other clustering algorithm that can take care of this?

2
How are you calculating the clusters? By definition of knn but putting a size on the number of observations you can have in each group your results will be bias and the results could be incorrect, especially if you plan and using the model on real dataEdeki Okoh
This might be a good sign that you should select less clusters for your KMeans!MaximeKan
I'm not sure why you'd want to do this, and if you do, it's not k-means clustering, but here's a thought: Do k-means clustering, then, for clusters below the size minimum, find the nearest neighbor to the cluster center that is NOT already in the cluster, and move it there. Repeat. I don't know, however, how to interpret what that would really mean.ViennaMike

2 Answers

1
votes

For those who still looking for an answer. I found a good module or this module that deal with this kind of problem

Use pip install size-constrained-clustering or pip install git+https://github.com/jingw2/size_constrained_clustering.git and use MinMaxKMeansMinCostFlow where you can select the size_min and size_max

n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
model = minmax.MinMaxKMeansMinCostFlow(n_clusters, size_min=400,   size_max=800)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
0
votes

This will solve by k-means-constrained pip library.. check here

Example:

>>> from k_means_constrained import KMeansConstrained
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...                [4, 2], [4, 4], [4, 0]])
>>> clf = KMeansConstrained(
...     n_clusters=2,
...     size_min=2,
...     size_max=5,
...     random_state=0
... )
>>> clf.fit_predict(X)
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> clf.cluster_centers_
array([[ 1.,  2.],
       [ 4.,  2.]])
>>> clf.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)