0
votes

I have a large sparse matrix (using scipy.sparse) with I rows and U columns, U is much greater than I. I have a list of U random numbers in the range of 0:I. I would like to create a new sparse matrix which will be a U * U sparse matrix, the row for user u will hold all the U values in row i of the original sparse matrix. For example, if the original matrix is a 3*5 matrix:

0,0,2,1,0
0,0,3,4,1
1,1,0,2,0

and the list of random numbers is [0,0,2,1,2]

The resulting matrix should be:

0,0,2,1,0
0,0,2,1,0
1,1,0,2,0
0,0,3,4,1
1,1,0,2,0

I am using this code now, which is very very slow:

for u in range(U):
    i= random_indices[u]
    if u == 0:
        output_sparse_matrix = original_sparse_matrix[i, :]
    else:
        output_sparse_matrix = vstack((output_sparse_matrix,
                                       original_sparse_matrix[i, :]))

Any suggestions on how this can be done quicker?

Update I used Jérôme Richard's suggestion, but inside a loop - since I got an out of memory error. This is the solution that worked:

bins = np.array_split(random_indices, 10)
output_sparse_matrix = original_sparse_matrix[bins[0]]

for bin in bins[1:10]:
   output_sparse_matrix = vstack((output_sparse_matrix ,original_sparse_matrix[bin]))
2
Are you using numpy? or scipy.sparse?hpaulj
I'm using scipy.sparseRoee Anuar
Even with dense arrays, repeatedly doing concatenate in a loop is slow. There we recommend collecting the arrays in a list and doing one join at the end. I haven't played with sparse.vstack much, but I think the same applies sparse.vstack combines the coo attributes of its arguments, and uses them to make a new coo matrix.hpaulj
Look at the code for sparse.vstack. It delegates the task to sparse.bmat, bmat([[b] for b in blocks]).hpaulj

2 Answers

1
votes

vstack create a new matrix for every iteration. This is the main source of slowdown since the complexity of the algorithm is O(U^3). You can just append the new lines in a Python list and then vstack the list of lines. Alternatively, a better approach is just to use the following Numpy expression :

original_sparse_matrix[random_indices, :]

1
votes

This may no be faster but you can try using fancy indexing:

output_sparse_matrix = input_sparse_matrix[random_indices]

provided random_indices is a list the above should give the desired result.

Applying this to your original example:

from scipy.sparse import csr_matrix

a = csr_matrix([[0,0,2,1,0],
[0,0,3,4,1],
[1,1,0,2,0]])

indices =  [0,0,2,1,2]


output_matrix = a[indices]

print(output_matrix.todense())