5
votes

I am performing mean shift clustering on a dataset. estimate_bandwidth function estimates the appropriate bandwidth to perform mean-shift clustering.

Syntax:

sklearn.cluster.estimate_bandwidth(X, quantile=0.3, n_samples=None, random_state=0)

I found out that the estimated bandwidth increases with increase in quantile resulting in less number of clusters. Similarly, decrease in quantile decreases the bandwidth and hence higher no. of clusters.

So, it seems no. of clusters is dependent upon quantile value chosen.

How to choose the optimum quantile?

1
Luck and experience. Unfortunately. But what is the “optimum” anyway?Has QUIT--Anony-Mousse
"Optimum" in a sense that the clusters are stable.gruangly
Then infinity would b optimal in that sense.Has QUIT--Anony-Mousse

1 Answers

0
votes

The quantile is used in KNN (which is used inside the estimate_bandwidth function) to determine the bandwidth.
Concretely:

n = Number of samples in KNN = number of samples in the batch * quantile

Bandwidth will be then calculated based on the average pairwise distances between the samples that are in the same cluster (returned by KNN). So you can use this to kind of figure out how to set the bandwidth. The bandwidth that is returned by this function will, on average, cover n samples, which will strongly affect the number of clusters that Mean Shift will return.