I wrote my own Shared Nearest Neighbor(SNN) clustering algorithm, according to the original paper. Essentially, I get the nearest neighbors for each data point, precompute the distance matrix with Jaccard distance, and pass the distance matrix to DBSCAN.
To accelerate the algorithm, I only compute the Jaccard distance between two data points if they are nearest neighbors of each other and have over a certain number of shared neighbors. I also take advantage of the symmetry of the distance matrix, as I only compute half the matrix.
However, my algorithm is slow and takes much longer than common clustering algorithms, such as K-Means or DBSCAN. Can someone look at my codes and suggest how I can improve my codes and make the algorithm faster?
def jaccard(a,b):
"""
Computes the Jaccard distance between two arrays.
Parameters
----------
a: an array.
b: an array.
"""
A = np.array(a, dtype='int')
B = np.array(b, dtype='int')
A = A[np.where(A > -1)[0]]
B = B[np.where(B > -1)[0]]
union = np.union1d(A,B)
intersection = np.intersect1d(A,B)
return 1.0 - len(intersection)*1.0 / len(union)
def iterator_dist(indices, k_min=5):
"""
An iterator that computes the Jaccard distance for any pair of stars.
Parameters:
indices: the indices of nearest neighbors in the chemistry-velocity
space.
"""
for n in range(len(indices)):
for m in indices[n][indices[n] > n]:
if len(np.intersect1d(indices[n], indices[m])) > k_min:
dist = jaccard(indices[n], indices[m])
yield (n, m, dist)
# load data here
data =
# hyperparameters
n_neighbors =
eps =
min_samples =
k_min =
# K Nearest Neighbors
nbrs = NearestNeighbors(n_neighbors=n_neighbors).fit(data)
distances, indices = nbrs.kneighbors()
# distance matrix
S = lil_matrix((len(distances), len(distances)))
for (n, m, dist) in iterator_dist(indices, k_min):
S[n,m] = dist
S[m,n] = dist
db = DBSCAN(eps=eps, min_samples=min_samples, metric='precomputed',
n_jobs=-1).fit(S)
labels = db.labels_