I am attempting to locate clusters of like objects. I have computed a value for each object-to-object comparison and created a matrix of the form:
header = [1, 2, 3, 4, 5]
matrix = [[0, 100, 0, 0, 0]
[100, 0, 0, 0, 0]
[0, 0, 0, 0, 0]
[0, 0, 0, 0, 0]
[0, 0, 0, 0, 0]]
I pass the matrix to the sklearn Affinity Propagation module:
matrix = np.array(matrix)
cluster = AffinityPropagation(preference="precomputed")
cls = cluster.fit_predict(matrix)
In the example given, I would expect 1 and 2 to be clustered, as 1-2 / 2-1 is 100, and all other values are zero. But the cls array does not reflect this:
cls = [0 0 0 0 1]
Which indicates that 1, 2, 3, and 4 are one cluster, and 5 is a separate cluster.
I've attempted passing the upper right triangular matrix, varying the value magnitude (ie 0-1 vice 0-100) etc, and it does not cluster as expected.
Thoughts on what I am missing?
ADDITIONAL INFO 10/24/2014:
I am performing a pairwise comparison of my objects, and from that I generate a number that indicates how well each object relates with every other. Many of these objects do not relate at all, so they result in a "0" value.
This creates a sparse n-by-n matrix, where n is on the order of 10s to 100s of objects.
Visually, it is trivial for me to "cluster" these objects for further analysis. In the below case 1 relates to 2, and 2 relates to 3, but 1 and 3 do not DIRECTLY relate. I would continue processing with 1, 2, and 3, and ignore 4 and 5. (In my actual data, I would likely have multiple valid clusters within a single matrix).
header = [1, 2, 3, 4, 5]
matrix = [[0, 100, 0, 0, 0]
[100, 0, 96, 0, 0]
[0, 96, 0, 0, 0]
[0, 0, 0, 0, 0]
[0, 0, 0, 0, 0]]
My research indicates that Affinity Propagation is good at finding clusters in sparse matrices, and and my pairwise comparison is effectively generating a "precomputed" affinity matrix.
While it is easy to find these clusters visually, I would like to automate it so I can integrate it with the code that comes before and after. However, as the original post indicates, I am not generating meaningful clusters.
The question:
Is some kind of processing required to generate meaningful clusters starting with the kind of matrix I have described?
Am I neglecting a step or otherwise inserting an error into the algorithm such that it fails to find my clusters?
Should I be using a different clustering method (DBSCAN, k-means, etc) on this kind of data?