12
votes

I have a Python list of row information for a sparse matrix. Each row is represented as a list of (column, value) tuples. Call it alist:

alist = [[(1,10), (3,-3)],
         [(2,12)]]

How can I efficiently construct a scipy sparse matrix from this list of lists, resulting in a matrix like this:

0  10   0  -3
0   0  12   0

The obvious approach is to make a scipy.sparse.lil_matrix, which internally has this "list of lists" structure. But from the scipy.sparse.lil_matrix — SciPy v0.19.0 Reference Guide I see just three ways of constructing them:

  • starting from a dense array
  • starting from another sparse array
  • just constructing an empty array

So the only way to get fresh data in is either to solve this problem with some other sparse matrix representation, or to start with a dense array, neither of which address the initial problem, and both of which seem likely to be less efficient representations than lil_matrix itself for this data.

I guess I can make an empty one, and use a loop to add values, but surely I'm missing something.

The scipy documentation is really frustrating when it comes to sparse matrices.

4

4 Answers

8
votes

Your data layout is an unusual one. Here's my first stab at using it.

In [565]: M = sparse.lil_matrix((2,4), dtype=int)
In [566]: M
Out[566]: 
<2x4 sparse matrix of type '<class 'numpy.int32'>'
    with 0 stored elements in LInked List format>
In [567]: for i,row in enumerate(alist):
     ...:     for col in row:
     ...:         M[i, col[0]] = col[1]
     ...:         
In [568]: M
Out[568]: 
<2x4 sparse matrix of type '<class 'numpy.int32'>'
    with 3 stored elements in LInked List format>
In [569]: M.A
Out[569]: 
array([[ 0, 10,  0, -3],
       [ 0,  0, 12,  0]])

Yes, it is iterative; and lil is the best format for that purpose.

Or using the common coo style of inputs:

In [580]: data,col,row = [],[],[]
In [581]: for i, rr in enumerate(alist):
     ...:     for cc in rr:
     ...:         row.append(i)
     ...:         col.append(cc[0])
     ...:         data.append(cc[1])
     ...:         
In [582]: data,col,row
Out[582]: ([10, -3, 12], [1, 3, 2], [0, 0, 1])
In [583]: M1=sparse.coo_matrix((data,(row,col)),shape=(2,4))
In [584]: M1
Out[584]: 
<2x4 sparse matrix of type '<class 'numpy.int32'>'
    with 3 stored elements in COOrdinate format>
In [585]: M1.A
Out[585]: 
array([[ 0, 10,  0, -3],
       [ 0,  0, 12,  0]])

Another option is to create the blank lil matrix, and directly fill in its attributes:

In other words, start with:

In [591]: m.data
Out[591]: array([[], []], dtype=object)
In [592]: m.rows
Out[592]: array([[], []], dtype=object)

and change them to:

In [587]: M.data
Out[587]: array([[10, -3], [12]], dtype=object)
In [588]: M.rows
Out[588]: array([[1, 3], [2]], dtype=object)

It would still require the 2 level iteration on your alist structure.

In [593]: for i, rr in enumerate(alist):
     ...:     for cc in rr:
     ...:         m.rows[i].append(cc[0])
     ...:         m.data[i].append(cc[1])       
In [594]: m
Out[594]: 
<2x4 sparse matrix of type '<class 'numpy.int32'>'
    with 3 stored elements in LInked List format>
In [595]: m.A
Out[595]: 
array([[ 0, 10,  0, -3],
       [ 0,  0, 12,  0]])

In another comment you mentioned the difficulty in understanding the csr indptr. The easiest way to get that is to convert one these formats:

In [597]: Mr=M.tocsr()
In [598]: Mr.indptr
Out[598]: array([0, 2, 3], dtype=int32)
In [599]: Mr.data
Out[599]: array([10, -3, 12])
In [600]: Mr.indices
Out[600]: array([1, 3, 2], dtype=int32)
8
votes

If you have the whole alist before creating the sparse matrix, there is no need to use lil_matrix, since that is optimized for incrementally updating the sparse matrix.

If you want to do any sort of arithmetic with the matrix afterwords, csr_matrix is probably your best choice. You can construct the csr_matrix directly by using (data, (row, column)) format, like this:

In [40]: alist = [[(1,10), (3,-3)],
    ...:          [(2,12)]]

In [41]: i, j, data = zip(*((i, t[0], t[1]) for i, row in enumerate(alist) for t in row))

In [42]: (i, j, data)
Out[42]: ((0, 0, 1), (1, 3, 2), (10, -3, 12))

In [43]: csr_matrix((data, (i, j)), shape=(2, 4)).todense()
Out[43]: 
matrix([[ 0, 10,  0, -3],
        [ 0,  0, 12,  0]], dtype=int64)

If efficiency is a real concern, you can create the csr_matrix internal format (using indptr) directly:

In [57]: indptr = np.cumsum([0] + [len(row) for row in alist])

In [58]: j, data = zip(*(t for row in alist for t in row))

In [59]: csr_matrix((data, j, indptr), shape=(2, 4)).todense()
Out[59]: 
matrix([[ 0, 10,  0, -3],
        [ 0,  0, 12,  0]])

If you're converting to pandas afterwords, coo_matrix is the way to go, since pandas converts to coo_matrix anyway:

In [41]: i, j, data = zip(*((i, t[0], t[1]) for i, row in enumerate(alist) for t in row))

In [43]: coo_matrix((data, (i, j)), shape=(2, 4))
4
votes

You can create a dict of positions and values from the list of (column, value) tuples alist and then use dok_matrix to construct the sparse matrix

>>> d = {(i,j):v for i,l in enumerate(alist) for j,v in l}
>>> d
{(0, 1): 10, (0, 3): -3, (1, 2): 12}
>>> 
>>> from operator import itemgetter
>>> m = max(d.keys(), key=itemgetter(0))[0] + 1
>>> n = max(d.keys(), key=itemgetter(1))[1] + 1
>>> m,n
(2, 4)
>>>
>>> from scipy.sparse import dok_matrix
>>> S = dok_matrix((m,n), dtype=int)
>>> for pos,v in d.items():
...     S[pos] = v
... 
>>> S.todense()
matrix([[ 0, 10,  0, -3],
        [ 0,  0, 12,  0]])
>>> 
3
votes

Just wanted to post another answer using coo_matrix, It is a fast format for constructing sparse matrices.

>>> alist = [[(1, 10), (3, -3)], [(2, 12)]]
>>> row, col, data = zip(*((i,j,v) for i,l in enumerate(alist) for j,v in l))
>>>
>>> from scipy.sparse import coo_matrix
>>> S = coo_matrix((data, (row, col)), (max(row)+1,max(col)+1), dtype=np.int8)
>>> S.todense()
matrix([[ 0, 10,  0, -3],
        [ 0,  0, 12,  0]], dtype=int8)
>>>