6
votes

I am using Python with numpy, scipy and scikit-learn module.

I'd like to classify the arrays in very big sparse matrix. (100,000 * 100,000)

The values in the matrix are equal to 0 or 1. The only thing I have is the index of value = 1.

a = [1,3,5,7,9] 
b = [2,4,6,8,10]

which means

a = [0,1,0,1,0,1,0,1,0,1,0]
b = [0,0,1,0,1,0,1,0,1,0,1]

How can I change the index array to the sparse array in scipy ?

How can I classify those array quickly ?

Thank you very much.

1
this makes me wonder, would it be possible to just make the entire matrix non-sparse, since all the values are 0 or 1 anyways, so instead of 64bit per float or so, you only use one bit each? ( i know this doesnt solve your problem or so, but your question made me come up with this question)usethedeathstar
What kind of similarity do you want to compute? Why do you need the sparse matrix, instead of just using the indices? How about something simple like len(set(a) & set(b)) / float(len(a))?w-m
Actually, I'd like to group those array from the similarity. For example [1,1,1,0] is more like [1,1,0,0] but inverse with [0,0,0,1]. Since the number of columns and rows are large. I dont' know is there any method could do it sooner.Jimmy Lin
How many groups are you trying for -- 10 x 10k, 100 x 1k ? Have you looked through scikit-learn clustering ?denis

1 Answers

4
votes

If you choose the sparse coo_matrix you can create it passing the indices like:

from scipy.sparse import coo_matrix
import scipy
nrows = 100000
ncols = 100000
row = scipy.array([1,3,5,7,9])
col = scipy.array([2,4,6,8,10])
values = scipy.ones(col.size)
m = coo_matrix((values, (row,col)), shape=(nrows, ncols), dtype=float)