I'm trying to implement a function in NumPy/Scipy to compute Jensen-Shannon divergence between a single (training) vector and a large number of other (observation) vectors. The observation vectors are stored in a very large (500,000x65536) Scipy sparse matrix (a dense matrix won't fit into memory).
As part of the algorithm I need to compute T+Oi for each observation vector Oi, where T is the training vector. I wasn't able to find a way to do this using NumPy's usual broadcasting rules, since sparse matrices don't seem to support those (if T is left as a dense array, Scipy tries to make the sparse matrix dense first, which runs out of memory; if I make T into a sparse matrix, T+Oi fails because the shapes are inconsistent).
Currently I am taking the grossly inefficient step of tiling the training vector into a 500,000x65536 sparse matrix:
training = sp.csr_matrix(training.astype(np.float32))
tindptr = np.arange(0, len(training.indices)*observations.shape[0]+1, len(training.indices), dtype=np.int32)
tindices = np.tile(training.indices, observations.shape[0])
tdata = np.tile(training.data, observations.shape[0])
mtraining = sp.csr_matrix((tdata, tindices, tindptr), shape=observations.shape)
But this takes up a huge amount of memory (around 6GB), when it's only storing ~1500 "real" elements. It's also pretty slow to construct.
I tried to get clever by using stride_tricks to make the CSR matrix's indptr and data members not use extra memory on the repeated data.
training = sp.csr_matrix(training)
mtraining = sp.csr_matrix(observations.shape,dtype=np.int32)
tdata = training.data
vdata = np.lib.stride_tricks.as_strided(tdata, (mtraining.shape[0], tdata.size), (0, tdata.itemsize))
indices = training.indices
vindices = np.lib.stride_tricks.as_strided(indices, (mtraining.shape[0], indices.size), (0, indices.itemsize))
mtraining.indptr = np.arange(0, len(indices)*mtraining.shape[0]+1, len(indices), dtype=np.int32)
mtraining.data = vdata
mtraining.indices = vindices
But this doesn't work because the strided views mtraining.data and mtraining.indices are the wrong shape (and according to this answer there's no way to make it the right shape). Trying to make them look flat using the .flat iterator fails because it doesn't look enough like an array (e.g. it doesn't have a dtype member), and using the flatten() method ends up making a copy.
Is there any way to get this done?
+=
! By the way, your implementation of the tiling is very smart and efficient, I don't think you can get any better than that: I tried feedingcsr_matrix
a view of the vector reshaped withas_strided
to have 500000 rows, and it took much longer than your approach, I think internally the array is being copied, breaking the magic. Your second approach, as you note, cannot be done with numpy. – Jaime