I am computing cosine similarity between two large sets of vectors (with the same features). Each set of vectors is represented as a scipy CSR sparse matrix, A and B. I want to compute A x B^T, which will not be sparse. However, I only need to keep track of values exceeding some threshold, e.g. 0.8. I am trying to implement this in Pyspark with vanilla RDDs, with the idea of using the fast vector operations implemented for scipy CSR matrices.
The rows of A and B are normalized, so to compute cosine similarity I simply need to find the dot product of each row from A with each row from B. The dimensions of A are 5,000,000 x 5,000. The dimensions of B are 2,000,000 x 5,000.
Suppose A and B are too large to fit into memory on my worker nodes as broadcast variables. How should I approach parallelizing both A and B in an optimal way?
EDIT After I posted my solution, I have been exploring other approaches which may be clearer and more optimal, particularly the columnSimilarities() function implemented for Spark MLlib IndexedRowMatrix objects. (Which pyspark abstraction is appropriate for my large matrix multiplication?)