Calculate Similarity of Sparse Matrix

Question

I am using Python with numpy, scipy and scikit-learn module.

I'd like to classify the arrays in very big sparse matrix. (100,000 * 100,000)

The values in the matrix are equal to 0 or 1. The only thing I have is the index of value = 1.

a = [1,3,5,7,9] 
b = [2,4,6,8,10]

which means

a = [0,1,0,1,0,1,0,1,0,1,0]
b = [0,0,1,0,1,0,1,0,1,0,1]

How can I change the index array to the sparse array in scipy ?

How can I classify those array quickly ?

Thank you very much.

this makes me wonder, would it be possible to just make the entire matrix non-sparse, since all the values are 0 or 1 anyways, so instead of 64bit per float or so, you only use one bit each? ( i know this doesnt solve your problem or so, but your question made me come up with this question) — usethedeathstar
What kind of similarity do you want to compute? Why do you need the sparse matrix, instead of just using the indices? How about something simple like len(set(a) & set(b)) / float(len(a))? — w-m
Actually, I'd like to group those array from the similarity. For example [1,1,1,0] is more like [1,1,0,0] but inverse with [0,0,0,1]. Since the number of columns and rows are large. I dont' know is there any method could do it sooner. — Jimmy Lin
How many groups are you trying for -- 10 x 10k, 100 x 1k ? Have you looked through scikit-learn clustering ? — denis

Saullo G. P. Castro Saullo G. P. Castro · Accepted Answer · 2013-07-19T11:54:30

If you choose the sparse coo_matrix you can create it passing the indices like:

from scipy.sparse import coo_matrix
import scipy
nrows = 100000
ncols = 100000
row = scipy.array([1,3,5,7,9])
col = scipy.array([2,4,6,8,10])
values = scipy.ones(col.size)
m = coo_matrix((values, (row,col)), shape=(nrows, ncols), dtype=float)

Calculate Similarity of Sparse Matrix

1 Answers