2
votes

I have a relatively large NumPy array (nearly 300k rows and 20+ columns, though most values are 0) for which I need to compute a distance matrix using scikit-learn's pairwise_distances function.

Unfortunately, this process runs into a memory error unless I convert the input array to a sparse matrix. SciPy offers many sparse matrix classes and I do not know which one is best for this particular situation.

I found an SO answer that favors CSR or CSC, but I am unclear which one would be best to compute a distance matrix. Any suggestions are welcome!

1
A distance matrix isn't sparse. Well, I suppose it could be sparse if you have a lot of duplicate points, but that is very rarely ever the case. - jme
The input array, not the distance matrix, is the one that I want to transform to a sparse matrix. - Gyan Veda
Ah, I see. But even then, the resulting distance matrix will have n choose 2 entries, which (for n=300,000) most certainly won't fit into memory. So converting the input array to a sparse array won't help much, I think. - jme
If you want to want to compute statistics over the pairwise distances, it might not make sense to keep the entire array in memory anyways. What are you doing with this matrix? - Hooked
The distance matrix will be an input to scikit-learn's silhouette_score function, which evaluates a clustering solution. I precompute the distance matrix because pairwise_distances can be parallelized, whereas silhouette_score, which computes a distance matrix in the background, cannot. - Gyan Veda

1 Answers

1
votes

CSR is ordered by rows, CSC is ordered by columns. So accessing rows would be faster with CSR and accessing columns would be faster using CSC. Since sklearn.metrics.pairwise.pairwise_distances uses as input, X, where the rows are instances and columns are attributes, it will be accessing rows from the sparse matrix. Hence it might be more efficient to use CSR.