How to choose appropriate quantile value while estimating bandwidth in MeanShift module of python?

Question

I am performing mean shift clustering on a dataset. estimate_bandwidth function estimates the appropriate bandwidth to perform mean-shift clustering.

Syntax:

sklearn.cluster.estimate_bandwidth(X, quantile=0.3, n_samples=None, random_state=0)

I found out that the estimated bandwidth increases with increase in quantile resulting in less number of clusters. Similarly, decrease in quantile decreases the bandwidth and hence higher no. of clusters.

So, it seems no. of clusters is dependent upon quantile value chosen.

How to choose the optimum quantile?

Luck and experience. Unfortunately. But what is the “optimum” anyway? — Has QUIT--Anony-Mousse

Hyperloop Hyperloop · Accepted Answer · 2019-04-04T21:59:11

The quantile is used in KNN (which is used inside the estimate_bandwidth function) to determine the bandwidth.
Concretely:

n = Number of samples in KNN = number of samples in the batch * quantile

Bandwidth will be then calculated based on the average pairwise distances between the samples that are in the same cluster (returned by KNN). So you can use this to kind of figure out how to set the bandwidth. The bandwidth that is returned by this function will, on average, cover n samples, which will strongly affect the number of clusters that Mean Shift will return.

How to choose appropriate quantile value while estimating bandwidth in MeanShift module of python?

1 Answers

n = Number of samples in KNN = number of samples in the batch * quantile