4
votes

I've got a SciPy sparse matrix A, let's say in CSR format, and a vector v of matching length.

What's the best way of row-scaling A with v, i.e., performing diag(v) * A?

2

2 Answers

3
votes

The easy way is to let scipy handle the gory details, and simply do:

scipy.sparse.spdiags(v, 0, len(v), len(v)) * A

EDIT If (and only if) your matrix is stored in CSC format, you can do the operation in place as follows:

A_csc.data = A_csc.data * v[A_csc.indices]

I've done some timings, at it wildly depends on the sparsity of the matrix and its size, feel free to play with the following code:

from __future__ import division
import numpy as np
import scipy.sparse as sps
import timeit

A_csr = None
A_csc = None
v = None

def time_row_scaling(n, dens) :
    global A_csr, A_csc, v
    v = np.random.rand(n)
    A_csr = sps.rand(n, n, density=dens, format='csr')
    A_csc = A_csr.tocsc()
    def row_scale(A_csc, v) :
        A_csc.data = A_csc.data * v[A_csc.indices]
    row_scaled_1 = sps.spdiags(v, 0, n , n) * A_csr
    row_scaled_2 = sps.spdiags(v, 0, n , n) * A_csc
    row_scale(A_csc, v)
    if n < 1000 :
        np.testing.assert_almost_equal(row_scaled_1.toarray(),
                                       row_scaled_2.toarray())
        np.testing.assert_almost_equal(row_scaled_1.toarray(),
                                       A_csc.toarray())
    A_csc = A_csr.tocsc()
    t1 = timeit.timeit('sps.spdiags(v, 0, len(v) , len(v)) * A_csr',
                       'from __main__ import sps, v, A_csr',
                       number=1)
    t2 = timeit.timeit('sps.spdiags(v, 0, len(v), len(v)) * A_csc',
                       'from __main__ import sps, v, A_csc',
                       number=1)
    t3 = timeit.timeit('A_csc.data = A_csc.data * v[A_csc.indices]',
                       'from __main__ import A_csc, v',
                       number=1)
    print t1, t2, t3

>>> time_row_scaling(1000, 0.01)
0.00100659830939 0.00102425072673 0.000231944553347
>>> time_row_scaling(1000, 0.1)
0.0017328105481 0.00311339379218 0.00239826562947
>>> time_row_scaling(10000, 0.01)
0.0162369397769 0.0359325217874 0.0216837368279
>>> time_row_scaling(10000, 0.1)
0.167978350747 0.492032396702 0.209231639536

Summary seems to be, if it is CSR, or really big, go with the simple first method. If it is a smallish, very sparse matrix, then the in place method will be faster, although all times are small then.

2
votes

sklearn provides utilities to do this in sklearn.utils.sparsefuncs.inplace_csr_row_scale. In my experiments, this slightly outperformed the methods Jaime suggested, and the csr_matrix.multiply method. Note that my experiments where on extremely large matrices - with a shape on the order of 10^7 x 10^4. sklearn comes in at about 2 seconds for matrices of this size; the other methods range from 2.5-5 seconds.

However, I've found that by far the most performant way to accomplish this is by hooking into MKL, using the provided mkl_?csrmultcsr method and a diagonal matrix.

Not providing code for doing so, as my wrapper is as of yet too buggy, but this performs the rescaling in about 0.3s for same size matrix as used above.

Perhaps someday numpy/scipy will hook into MKL for sparse math just like they do for dense math...