How can I improve my Shared Nearest Neighbor clustering algorithm?

Question

I wrote my own Shared Nearest Neighbor(SNN) clustering algorithm, according to the original paper. Essentially, I get the nearest neighbors for each data point, precompute the distance matrix with Jaccard distance, and pass the distance matrix to DBSCAN.

To accelerate the algorithm, I only compute the Jaccard distance between two data points if they are nearest neighbors of each other and have over a certain number of shared neighbors. I also take advantage of the symmetry of the distance matrix, as I only compute half the matrix.

However, my algorithm is slow and takes much longer than common clustering algorithms, such as K-Means or DBSCAN. Can someone look at my codes and suggest how I can improve my codes and make the algorithm faster?

def jaccard(a,b):

    """
    Computes the Jaccard distance between two arrays.

    Parameters
    ----------

    a: an array.

    b: an array.

    """

    A = np.array(a, dtype='int')
    B = np.array(b, dtype='int')
    A = A[np.where(A > -1)[0]]
    B = B[np.where(B > -1)[0]]
    union = np.union1d(A,B)
    intersection = np.intersect1d(A,B)
    return 1.0 - len(intersection)*1.0 / len(union)

def iterator_dist(indices, k_min=5):
    """
    An iterator that computes the Jaccard distance for any pair of stars. 

    Parameters: 

    indices: the indices of nearest neighbors in the chemistry-velocity 
    space. 

    """

    for n in range(len(indices)):
        for m in indices[n][indices[n] > n]:
            if len(np.intersect1d(indices[n], indices[m])) > k_min:
                dist = jaccard(indices[n], indices[m])
                yield (n, m, dist)

# load data here
data = 
# hyperparameters
n_neighbors = 
eps = 
min_samples = 
k_min = 
# K Nearest Neighbors
nbrs = NearestNeighbors(n_neighbors=n_neighbors).fit(data)
distances, indices = nbrs.kneighbors()
# distance matrix
S = lil_matrix((len(distances), len(distances)))
for (n, m, dist) in iterator_dist(indices, k_min):
    S[n,m] = dist
    S[m,n] = dist
db = DBSCAN(eps=eps, min_samples=min_samples, metric='precomputed', 
     n_jobs=-1).fit(S)
labels = db.labels_

Asking for improvements on working code is a better fit for Code Review. — Nico Schertler

Has QUIT--Anony-Mousse Has QUIT--Anony-Mousse · Accepted Answer · 2018-10-13T09:33:51

Writing fast python code is hard. The key is to avoid python wherever possible, and instead either use BLAS routines via numpy or, e.g., cython that is compiled code not interpreted. So at some point you'll need to switch from "real" python at least to typed cython code. Unless you can find a library that already implemented these operations low level enough for you.

But the obvious first step to do is to run a profiler to identify slow operations!

Secondly, consider avoiding a distance matrix. Anything involving a distance matrix tends to scale with O(n²) unless done very carefully. That is of course much slower than k-means and Euclidean DBSCAN.

How can I improve my Shared Nearest Neighbor clustering algorithm?

1 Answers