Adding a very repetitive matrix to a sparse one in numpy/scipy?

Question

I'm trying to implement a function in NumPy/Scipy to compute Jensen-Shannon divergence between a single (training) vector and a large number of other (observation) vectors. The observation vectors are stored in a very large (500,000x65536) Scipy sparse matrix (a dense matrix won't fit into memory).

As part of the algorithm I need to compute T+O_i for each observation vector O_i, where T is the training vector. I wasn't able to find a way to do this using NumPy's usual broadcasting rules, since sparse matrices don't seem to support those (if T is left as a dense array, Scipy tries to make the sparse matrix dense first, which runs out of memory; if I make T into a sparse matrix, T+O_i fails because the shapes are inconsistent).

Currently I am taking the grossly inefficient step of tiling the training vector into a 500,000x65536 sparse matrix:

training = sp.csr_matrix(training.astype(np.float32))
tindptr = np.arange(0, len(training.indices)*observations.shape[0]+1, len(training.indices), dtype=np.int32)
tindices = np.tile(training.indices, observations.shape[0])
tdata = np.tile(training.data, observations.shape[0])
mtraining = sp.csr_matrix((tdata, tindices, tindptr), shape=observations.shape)

But this takes up a huge amount of memory (around 6GB), when it's only storing ~1500 "real" elements. It's also pretty slow to construct.

I tried to get clever by using stride_tricks to make the CSR matrix's indptr and data members not use extra memory on the repeated data.

training = sp.csr_matrix(training)
mtraining = sp.csr_matrix(observations.shape,dtype=np.int32)
tdata = training.data
vdata = np.lib.stride_tricks.as_strided(tdata, (mtraining.shape[0], tdata.size), (0, tdata.itemsize))
indices = training.indices
vindices = np.lib.stride_tricks.as_strided(indices, (mtraining.shape[0], indices.size), (0, indices.itemsize))
mtraining.indptr = np.arange(0, len(indices)*mtraining.shape[0]+1, len(indices), dtype=np.int32)
mtraining.data = vdata
mtraining.indices = vindices

But this doesn't work because the strided views mtraining.data and mtraining.indices are the wrong shape (and according to this answer there's no way to make it the right shape). Trying to make them look flat using the .flat iterator fails because it doesn't look enough like an array (e.g. it doesn't have a dtype member), and using the flatten() method ends up making a copy.

Is there any way to get this done?

If you want to generate all the sums at once, then you are going to need the 6GB of storage anyway, so there is really nothing to win by delaying that. Just make sure to do the summing in-place, with +=! By the way, your implementation of the tiling is very smart and efficient, I don't think you can get any better than that: I tried feeding csr_matrix a view of the vector reshaped with as_strided to have 500000 rows, and it took much longer than your approach, I think internally the array is being copied, breaking the magic. Your second approach, as you note, cannot be done with numpy. — Jaime
CSR matrices can't be modified in-place, unfortunately (+= raises NotImplemented). So I guess I'm stuck with using 3x as much memory as I (theoretically) need to, which is painful as I'm getting near the limits of my (generous) 32GB. — Brendan Dolan-Gavitt

Jaime Jaime · Accepted Answer · 2013-03-07T02:11:38

Your other option, which I hadn't even considered, is to implement the sum in sparse format yourself, so that you can take full advantage of the periodic nature of your array. This can be very easy to do, if you abuse this peculiar behavior of scipy's sparse matrices:

>>> a = sps.csr_matrix([1,2,3,4])
>>> a.data
array([1, 2, 3, 4])
>>> a.indices
array([0, 1, 2, 3])
>>> a.indptr
array([0, 4])

>>> b = sps.csr_matrix((np.array([1, 2, 3, 4, 5]),
...                     np.array([0, 1, 2, 3, 0]),
...                     np.array([0, 5])), shape=(1, 4))
>>> b
<1x4 sparse matrix of type '<type 'numpy.int32'>'
    with 5 stored elements in Compressed Sparse Row format>
>>> b.todense()
matrix([[6, 2, 3, 4]])

So you don't even have to look for coincidences between your training vector and each of the rows of the observation matrix to add them up: just cram all the data with the right pointers in there, and what needs getting summed, will get summed when the data is accessed.

EDIT

Given the slowness of the first code, you can trade memory for speed as follows:

def csr_add_sparse_vec(sps_mat, sps_vec) :
    """Adds a sparse vector to every row of a sparse matrix"""
    # No checks done, but both arguments should be sparse matrices in CSR
    # format, both should have the same number of columns, and the vector
    # should be a vector and have only one row.

    rows, cols = sps_mat.shape
    nnz_vec = len(sps_vec.data)
    nnz_per_row = np.diff(sps_mat.indptr)
    longest_row = np.max(nnz_per_row)

    old_data = np.zeros((rows * longest_row,), dtype=sps_mat.data.dtype)
    old_cols = np.zeros((rows * longest_row,), dtype=sps_mat.indices.dtype)

    data_idx = np.arange(longest_row) < nnz_per_row[:, None]
    data_idx = data_idx.reshape(-1)
    old_data[data_idx] = sps_mat.data
    old_cols[data_idx] = sps_mat.indices
    old_data = old_data.reshape(rows, -1)
    old_cols = old_cols.reshape(rows, -1)

    new_data = np.zeros((rows, longest_row + nnz_vec,),
                        dtype=sps_mat.data.dtype)
    new_data[:, :longest_row] = old_data
    del old_data
    new_cols = np.zeros((rows, longest_row + nnz_vec,),
                        dtype=sps_mat.indices.dtype)
    new_cols[:, :longest_row] = old_cols
    del old_cols
    new_data[:, longest_row:] = sps_vec.data
    new_cols[:, longest_row:] = sps_vec.indices
    new_data = new_data.reshape(-1)
    new_cols = new_cols.reshape(-1)
    new_pointer = np.arange(0, (rows + 1) * (longest_row + nnz_vec),
                            longest_row + nnz_vec)

    ret = sps.csr_matrix((new_data, new_cols, new_pointer),
                         shape=sps_mat.shape)
    ret.eliminate_zeros()

    return ret

It is not as fast as before, but it can do 10,000 rows in about 1 s.:

In [2]: a
Out[2]: 
<10000x65536 sparse matrix of type '<type 'numpy.float64'>'
    with 15000000 stored elements in Compressed Sparse Row format>

In [3]: b
Out[3]: 
<1x65536 sparse matrix of type '<type 'numpy.float64'>'
    with 1500 stored elements in Compressed Sparse Row format>

In [4]: csr_add_sparse_vec(a, b)
Out[4]: 
<10000x65536 sparse matrix of type '<type 'numpy.float64'>'
    with 30000000 stored elements in Compressed Sparse Row format>

In [5]: %timeit csr_add_sparse_vec(a, b)
1 loops, best of 3: 956 ms per loop

EDIT This code is very, very slow

def csr_add_sparse_vec(sps_mat, sps_vec) :
    """Adds a sparse vector to every row of a sparse matrix"""
    # No checks done, but both arguments should be sparse matrices in CSR
    # format, both should have the same number of columns, and the vector
    # should be a vector and have only one row.

    rows, cols = sps_mat.shape

    new_data = sps_mat.data
    new_pointer = sps_mat.indptr.copy()
    new_cols = sps_mat.indices

    aux_idx = np.arange(rows + 1)

    for value, col in itertools.izip(sps_vec.data, sps_vec.indices) :
        new_data = np.insert(new_data, new_pointer[1:], [value] * rows)
        new_cols = np.insert(new_cols, new_pointer[1:], [col] * rows)
        new_pointer += aux_idx

    return sps.csr_matrix((new_data, new_cols, new_pointer),
                          shape=sps_mat.shape)

Adding a very repetitive matrix to a sparse one in numpy/scipy?

1 Answers