1
votes

Is there a way I can allocate memory for scipy sparse matrix functions to process large datasets?

Specifically, I'm attempting to use Asymmetric Least Squares Smoothing (translated into python here and the original here) to perform a baseline correction on a large mass spec dataset (length of ~60,000).

The function (see below) uses the scipy.sparse matrix operations.

def baseline_als(y, lam, p, niter):
  L = len(y)
  D = sparse.csc_matrix(np.diff(np.eye(L), 2))
  w = np.ones(L)
  for i in xrange(niter):
    W = sparse.spdiags(w, 0, L, L)
    Z = W + lam * D.dot(D.transpose())
    z = spsolve(Z, w*y)
    w = p * (y > z) + (1-p) * (y < z)
  return z

I have no problem when I pass data sets that are 10,000 or less in length:

baseline_als(np.ones(10000),100,0.1,10)

But when passing larger data sets, e.g.

baseline_als(np.ones(50000), 100, 0.1, 10)

I get a MemoryError, for the line

  D = sparse.csc_matrix(np.diff(np.eye(L), 2))
1
No you can't allocate memory for numpy arrays. I suspect you are getting the memory error in the np.eye or np.diff functions. Both will produced (L,L) shape arrays - very big ones. Try those without the sparse call. - hpaulj
@hpaulj You're right. The error arrises when the np.diff function is passed the large np.eye generated array: np.diff(np.eye(len(np.ones(50000)))) I'm new to python. Anyone know a work around? - Todd Duncombe
Usually when people make large sparse matrices, they try to construct them without first making the equivalent dense array. Making the dense one is convenient in small cases, but defeats many of the advantages of using sparse ones. - hpaulj
I've implemented ALS in python as well, and I found that using solve_banded sped things up significantly. I'd also be willing to share my implementation, if you want it. - perimosocordiae
@perimosocordiae I'd appreciate it if you could share the implementation. I plan on analyzing a reasonably large set data, any improvement in speed would be beneficial. - Todd Duncombe

1 Answers

2
votes

Try changing

D = sparse.csc_matrix(np.diff(np.eye(L), 2))

to

diag = np.ones(L - 2)
D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L-2)

D will be a sparse matrix in DIAgonal format. If it turns out that being in CSC format is important, convert it using the tocsc() method:

D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L-2).tocsc()

The following example shows that the old and new versions generate the same matrix:

In [67]: from scipy import sparse

In [68]: L = 8

Original:

In [69]: D = sparse.csc_matrix(np.diff(np.eye(L), 2))

In [70]: D.A
Out[70]: 
array([[ 1.,  0.,  0.,  0.,  0.,  0.],
       [-2.,  1.,  0.,  0.,  0.,  0.],
       [ 1., -2.,  1.,  0.,  0.,  0.],
       [ 0.,  1., -2.,  1.,  0.,  0.],
       [ 0.,  0.,  1., -2.,  1.,  0.],
       [ 0.,  0.,  0.,  1., -2.,  1.],
       [ 0.,  0.,  0.,  0.,  1., -2.],
       [ 0.,  0.,  0.,  0.,  0.,  1.]])

New version:

In [71]: diag = np.ones(L - 2)

In [72]: D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L-2)

In [73]: D.A
Out[73]: 
array([[ 1.,  0.,  0.,  0.,  0.,  0.],
       [-2.,  1.,  0.,  0.,  0.,  0.],
       [ 1., -2.,  1.,  0.,  0.,  0.],
       [ 0.,  1., -2.,  1.,  0.,  0.],
       [ 0.,  0.,  1., -2.,  1.,  0.],
       [ 0.,  0.,  0.,  1., -2.,  1.],
       [ 0.,  0.,  0.,  0.,  1., -2.],
       [ 0.,  0.,  0.,  0.,  0.,  1.]])