1
votes

I am attempting to locate clusters of like objects. I have computed a value for each object-to-object comparison and created a matrix of the form:

header =  [1, 2, 3, 4, 5]
matrix = [[0, 100, 0, 0, 0]
          [100, 0, 0, 0, 0]
          [0, 0, 0, 0, 0]
          [0, 0, 0, 0, 0]
          [0, 0, 0, 0, 0]]

I pass the matrix to the sklearn Affinity Propagation module:

matrix = np.array(matrix)
cluster = AffinityPropagation(preference="precomputed")
cls = cluster.fit_predict(matrix)

In the example given, I would expect 1 and 2 to be clustered, as 1-2 / 2-1 is 100, and all other values are zero. But the cls array does not reflect this:

cls = [0 0 0 0 1]

Which indicates that 1, 2, 3, and 4 are one cluster, and 5 is a separate cluster.

I've attempted passing the upper right triangular matrix, varying the value magnitude (ie 0-1 vice 0-100) etc, and it does not cluster as expected.

Thoughts on what I am missing?

ADDITIONAL INFO 10/24/2014:

I am performing a pairwise comparison of my objects, and from that I generate a number that indicates how well each object relates with every other. Many of these objects do not relate at all, so they result in a "0" value.

This creates a sparse n-by-n matrix, where n is on the order of 10s to 100s of objects.

Visually, it is trivial for me to "cluster" these objects for further analysis. In the below case 1 relates to 2, and 2 relates to 3, but 1 and 3 do not DIRECTLY relate. I would continue processing with 1, 2, and 3, and ignore 4 and 5. (In my actual data, I would likely have multiple valid clusters within a single matrix).

header =  [1,   2,   3,   4,   5]
matrix = [[0,  100,  0,   0,   0]
          [100, 0,  96,  0,  0]
          [0,  96,  0,   0,   0]
          [0,   0,   0,   0,   0]
          [0,   0,   0,   0,   0]]

My research indicates that Affinity Propagation is good at finding clusters in sparse matrices, and and my pairwise comparison is effectively generating a "precomputed" affinity matrix.

While it is easy to find these clusters visually, I would like to automate it so I can integrate it with the code that comes before and after. However, as the original post indicates, I am not generating meaningful clusters.

The question:

Is some kind of processing required to generate meaningful clusters starting with the kind of matrix I have described?

Am I neglecting a step or otherwise inserting an error into the algorithm such that it fails to find my clusters?

Should I be using a different clustering method (DBSCAN, k-means, etc) on this kind of data?

1

1 Answers

1
votes

0 is not a magic "do not link" value.

Since the affinity of objects 3 and 4 is the same to 1, 2, or 5, it does not matter where they are assigned to; they are all roughly of the same quality.

The stronger cohesion of 1 and 2 may make it preferrable to assign 3 and 4 there; and the desire to produce more than once cluster may yield that 5 remains separate. But it may also just be random, object 3 and 4 get assigned to the first exemplar of best affinity (being from cluster 1,2); and object 5 is just kept separate to have at least two components.

Use real data, not hand-crafted affinities.