6
votes

I've had encouraging results clustering a set of entity names using scikit-learn's affinity propagation implementation, with a modified Jaro-Winkler distance as the similarity metric, but my clusters are still too numerous (ie. too many false positives.)

I see in the scikit-learn documentation that there exists a 'preference' parameter that affects the number of clusters, with the following description:

preference : array-like, shape (n_samples,) or float, optional

Preferences for each point - points with larger values of preferences are more likely to be chosen as exemplars. The number of exemplars, ie of clusters, is influenced by the input preferences value. If the preferences are not passed as arguments, they will be set to the median of the input similarities.[0]

However, when I began tinkering with this value, I found that a very narrow range of values was giving me either too many clusters (preference=-11.13) or too few clusters (preference=-11.11).

Is there some way to determine what a 'reasonable' value of the preference parameter should be? And why would it be that I'm unable to obtain a non-extreme number of clusters?

Similar questions:

Affinity Propagation - Cluster Imbalance

Affinity Propagation preferences initialization

1
Don't overfit parameters! - Has QUIT--Anony-Mousse
I know! In this case we're manually reviewing the output so we have a good sense for what 'correct' should look like. - nitrl
I have tried playing with mean and min (of point similarities) and functions thereof to obtain a decent preference. I am still struggling to find a method that actually works in practice. - Apollys supports Monica
Did you find a good way to get a better super parameter while not overfitting the model? - cloudscomputes

1 Answers

1
votes

You could try using sklearn.model_selection.GridSearchCV or sklearn.model_selection.RandomizedSearchCV.

You could define a custom error measure that encourages the hyper-parameter search to generate smaller clusters. Then you can search several values to find one that is good for your dataset based on a validation set.

More info: http://scikit-learn.org/stable/modules/grid_search.html