I have a relatively large NumPy array (nearly 300k rows and 20+ columns, though most values are 0) for which I need to compute a distance matrix using scikit-learn's pairwise_distances function.
Unfortunately, this process runs into a memory error unless I convert the input array to a sparse matrix. SciPy offers many sparse matrix classes and I do not know which one is best for this particular situation.
I found an SO answer that favors CSR or CSC, but I am unclear which one would be best to compute a distance matrix. Any suggestions are welcome!
nchoose 2 entries, which (forn=300,000) most certainly won't fit into memory. So converting the input array to a sparse array won't help much, I think. - jmesilhouette_scorefunction, which evaluates a clustering solution. I precompute the distance matrix becausepairwise_distancescan be parallelized, whereassilhouette_score, which computes a distance matrix in the background, cannot. - Gyan Veda