2
votes

I have a really big (1.5M x 16M) sparse csr scipy matrix A. What i need to compute is the similarity of each pair of rows. I have defined the similarity as this:

Assume a and b are two rows of matrix A
a = (0, 1, 0, 4)
b = (1, 0, 2, 3)
Similarity (a, b) = 0*1 + 1*0 + 0*2 + 4*3 = 12

To compute all pairwise row similarities I use this (or Cosine similarity):

AT = np.transpose(A)
pairs = A.dot(AT)

Now pairs[i, j] is the similarity of row i and row j for all such i and j. This is quite similar to pairwise Cosine similarity of rows. So If there is an efficient parallel algorithm that computes pairwise Cosine similarity it would work for me as well.

The problem: This dot product is very slow because it uses just one cpu (I have access to 64 of those cpus on my server).

I can also export A and AT to a file and run any other external program that does the multiplication in parallel and get the results back to the Python program.

Is there any more efficient way of doing this dot product? or computing the pairwise similarity in Parallel?

2

2 Answers

4
votes

I finally used the 'Cosine' distance metric of scikit-learn and its pairwise_distances functions which support sparse matrices and is highly parallelised.

sklearn.metrics.pairwise.pairwise_distances(X, Y=None, metric='euclidean', n_jobs=1, **kwds)

I could also divide A into n horizontal parts and use the parallel python package to run multiple multiplications and horizontally stack the results later.

1
votes

I wrote own implementation using sklearn. It is not parallel but it quite fast for large matrices.

from scipy.sparse import spdiags
from sklearn.preprocessing import normalize

def get_similarity_by_x_dot_x_greedy_for_memory(sp_matrix):
    sp_matrix = sp_matrix.tocsr()
    matrix = sp_matrix.dot(sp_matrix.T)
    # zero diagonal
    diag = spdiags(-matrix.diagonal(), [0], *matrix.shape, format='csr')
    matrix = matrix + diag
    return matrix

def get_similarity_by_cosine(sp_matrix):
    sp_matrix = normalize(sp_matrix.tocsr())
    return get_similarity_by_x_dot_x_greedy_for_memory(sp_matrix)