0
votes

I have several sparse vectors represented as lists of tuples eg.

[[(22357, 0.6265631775164965),
  (31265, 0.3900572375543419),
  (44744, 0.4075397480094991),
  (47751, 0.5377595092643747)],
 [(22354, 0.6265631775164965),
  (31261, 0.3900572375543419),
  (42344, 0.4075397480094991),
  (47751, 0.5377595092643747)],
...
]

And my goal is to compose scipy.sparse.csr_matrix from several millions of vectors like this.

I would like to ask if there exists some simple elegant solution for this kind of conversion without trying to stuck everything to memory.

EDIT: Just a clarification: My goal is to build the 2d matrix, where each of my sparse vectors represent one row in matrix.

2
So is the goal to create one long vector or a 2d matrix? Are the vectors all of the same length coincidentally?user2379410
The goal is to create 2d matrix, where each of my sparse vectors will represent one line.ziky90

2 Answers

3
votes

Collecting indices,data into a structured array avoids the integer-double conversion issue. It is also a bit faster than the vstack approach (in limited testing) (With list data like this np.array is faster than np.vstack.)

indptr = np.cumsum([0]+[len(i) for i in vectors])
aa = np.array(vectors,dtype='i,f').flatten()
A = sparse.csr_matrix((aa['f1'], aa['f0'], indptr))

I substituted the list comprehension for map since I'm using Python3.

Indicies in the coo format (data, (i,j)) might be more intuitive

ii = [[i]*len(v) for i,v in enumerate(vectors)])
ii = np.array(ii).flatten()
aa = np.array(vectors,dtype='i,f').flatten()
A2 = sparse.coo_matrix((aa['f1'],(np.array(ii), aa['f0'])))
# A2.tocsr()

Here, ii from the 1st step is the row numbers for each sublist.

[[0, 0, 0, 0],
 [1, 1, 1, 1],
 [2, 2, 2, 2],
 [3, 3, 3, 3],
 ...]]

This construction method is slower than the csr direct indptr.


For a case where there are differing numbers of entries per row, this approach works (using intertools.chain to flatten lists):

A sample list (no empty rows for now):

In [779]: vectors=[[(1, .12),(3, .234),(6,1.23)],
                   [(2,.222)],
                   [(2,.23),(1,.34)]]

row indexes:

In [780]: ii=[[i]*len(v) for i,v in enumerate(vectors)]
In [781]: ii=list(chain(*ii))

column and data values pulled from tuples and flattened

In [782]: jj=[j for j,_ in chain(*vectors)]    
In [783]: data=[d for _,d in chain(*vectors)]

In [784]: ii
Out[784]: [0, 0, 0, 1, 2, 2]

In [785]: jj
Out[785]: [1, 3, 6, 2, 2, 1]

In [786]: data
Out[786]: [0.12, 0.234, 1.23, 0.222, 0.23, 0.34]

In [787]: A=sparse.csr_matrix((data,(ii,jj)))  # coo style input

In [788]: A.A
Out[788]: 
array([[ 0.   ,  0.12 ,  0.   ,  0.234,  0.   ,  0.   ,  1.23 ],
       [ 0.   ,  0.   ,  0.222,  0.   ,  0.   ,  0.   ,  0.   ],
       [ 0.   ,  0.34 ,  0.23 ,  0.   ,  0.   ,  0.   ,  0.   ]])
1
votes

Consider the following:

import numpy as np
from scipy.sparse import csr_matrix

vectors = [[(22357, 0.6265631775164965),
            (31265, 0.3900572375543419),
            (44744, 0.4075397480094991),
            (47751, 0.5377595092643747)],
           [(22354, 0.6265631775164965),
            (31261, 0.3900572375543419),
            (42344, 0.4075397480094991),
            (47751, 0.5377595092643747)]]

indptr = np.cumsum([0] + map(len, vectors))
indices, data = np.vstack(vectors).T
A = csr_matrix((data, indices.astype(int), indptr))

Unfortunately, this way the column indices are converted from integers to doubles and back. This works correctly for up to very large matrices, but is not ideal.