2
votes

I am seeing something strange while using AffinityPropagation from sklearn. I have a 4 x 4 numpy ndarray - which is basically the affinity-scores. sim[i, j] has the affinity score of [i, j]. Now, when I feed into the AffinityPropgation function, I get a total of 4 labels.

here is an similar example with a smaller matrix:

In [215]: x = np.array([[1, 0.2, 0.4, 0], [0.2, 1, 0.8, 0.3], [0.4, 0.8, 1, 0.7], [0, 0.3, 0.7, 1]]
   .....: )

In [216]: x
Out[216]:
array([[ 1. ,  0.2,  0.4,  0. ],
       [ 0.2,  1. ,  0.8,  0.3],
       [ 0.4,  0.8,  1. ,  0.7],
       [ 0. ,  0.3,  0.7,  1. ]])

In [217]: clusterer = cluster.AffinityPropagation(affinity='precomputed')

In [218]: f = clusterer.fit(x)

In [219]: f.labels_
Out[219]: array([0, 1, 1, 1])

This says (according to Kevin), that the first sample (0th-indexed row) is a cluster (Cluster # 0) on its own and the rest of the samples are in another cluster (cluster # 1). But, still, I do not understand this output. What is a sample here? What are the members? I want to have a set of pairs (i, j) assigned to one cluster, another set of pairs assigned to another cluster and so on.

It looks like a 4-sample x 4-feature matrix..which I do not want. Is this the problem? If so, how to convert this to a nice 4-sample x 4-sample affinity-matrix?

The documentation (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html) says

fit(X, y=None)
Create affinity matrix from negative euclidean distances, then apply affinity propagation clustering.
Parameters: 
X: array-like, shape (n_samples, n_features) or (n_samples, n_samples) :
Data matrix or, if affinity is precomputed, matrix of similarities / affinities.

Thanks!

2

2 Answers

3
votes

By your description it sounds like you are working with a "pairwise similarity matrix": x (although your example data does not show that). If this is the case your matrix should be symmertric so that: sim[i,j] == sim[j,i] with your diagonal values equal to 1. Example similarity data S:

S
array([[ 1.        ,  0.08276253,  0.16227766,  0.47213595,  0.64575131],
       [ 0.08276253,  1.        ,  0.56776436,  0.74456265,  0.09901951],
       [ 0.16227766,  0.56776436,  1.        ,  0.47722558,  0.58257569],
       [ 0.47213595,  0.74456265,  0.47722558,  1.        ,  0.87298335],
       [ 0.64575131,  0.09901951,  0.58257569,  0.87298335,  1.        ]])

Typically when you already have a distance matrix you should use affinity='precomputed'. But in your case, you are using similarity. In this specific example you can transform to pseudo-distance using 1-D. (The reason to do this would be because I don't know that Affinity Propagation will give you expected results if you give it a similarity matrix as input):

1-D
array([[ 0.        ,  0.91723747,  0.83772234,  0.52786405,  0.35424869],
       [ 0.91723747,  0.        ,  0.43223564,  0.25543735,  0.90098049],
       [ 0.83772234,  0.43223564,  0.        ,  0.52277442,  0.41742431],
       [ 0.52786405,  0.25543735,  0.52277442,  0.        ,  0.12701665],
       [ 0.35424869,  0.90098049,  0.41742431,  0.12701665,  0.        ]])

With that being said, I think this is where your interpretation was off:

This says that the first 3-rows are similar, 4th row is a cluster on its own, and the 5th row is also a cluster on its own. Totally of 3 clusters.

The f.labels_ array:

array([0, 1, 1, 1, 0])

is telling you that samples (not rows) 0 and 4 are in cluster 0 AND that samples 2, 3, and 4 are in cluster 1. You don't need 25 different labels for a 5 sample problem, that wouldn't make sense. Hope this helps a little, try the demo (inspect the variables along the way and compare them with your data), which starts with raw data; it should help you decide if Affinity Propagation is the right clustering algorithm for you.

0
votes

According to this page https://scikit-learn.org/stable/modules/clustering.html you can use a similarity matrix for AffinityPropagation.