Parallel Construction of Compressed Sparse Column Matrix

Question

Compressed Sparse Column Matrix (csc) with m columns, as defined here.

It has three arrays: Data, Indices, and Indptr.

indptr is a 1D array with a size of m+1
indices is row indices of all non-zero entries of the matrix.
indices[indptr[i]:indptr[i + 1]] is row indices of all non-zero entries of the i th column. The values of these non-zero entries are defined in data[indptr[i]:indptr[i + 1]]

Compressed sparse row (csr) matrix with n rows has a similar definition.

Problem:

Cpu works on n jobs. The p th job produces data for the q th row of a sparse matrix (csc or csr) with n rows and m columns. p and q don't have to be equal, as long as which job corresponds to which row is recorded.

How to efficiently collect the data from the jobs and produce a csc matrix?

Current solutions:

Approach A.

Empty arrays called shared_data, shared_indices, shared_indptr (they defines a csr matrix), shared_ordering are created. Sizes of shared_data and shared_indices depends on an estimation of number of nonzero entries. Set shared_indptr[0] = 0. These arrays are shared by jobs.
Counters of number of non-zero entries (count_nz) and number of rows entered to the csr matrix (count_row) are also shared by jobs.
Accessing resources described in 1 and 2 requires acquiring a lock.
Each job knows the column indices of the nonzero entries (job_indices) that it tries to put into the csr matrix. It also knows the values of the nonzero entries (job_data), and the number of the nonzero entries (job_nz).
Each job tries to acquire lock.
Upon acquiring the lock, a job does the following. Then release lock

.

after acquiring lock
shared_data[counter_nz:(counter_nz + job_nz)] = job_data
shared_indices[counter_nz:(counter_nz + job_nz)] = job_indices
shared_indptr[count_row + 1] = shared_indptr[count_row] + job_nz
shared_ordering[counter_row] = job_id
counter_nz  += job_nz
counter_row += 1
release lock.

create a csr matrix object by wrapping the shared arrays (like scipy.sparse.csr_matrix((shared_data, shared_indices, shared_indptr)).
Convert the csr to a csc matrix (the goal is to make a csc matrix instead of a csr matrix ...)

The problem with this approach is the lock really kills performance. It is like a dozen people tries to grab the same tool to work on different parts of a project.

Approach B

Each cpu works on m jobs and build a csc matrix with the method described in A. This way uses much more memory, but the matrices are generated more quickly.

Approach C

One cpu works on m job and build a csc matrix. This takes roughly twice more time than approach A on a 12 cpu machine.

Approach D (Update 06/05/2016)

There are s cpu cores. The shared_data, shared_indptr, and shared_indices arrays of the final csc matrix and the shared_ordering array are shared by all cpu cores. No lock needed. Each cpu core runs a single subprocess (instead of starting and stopping subprocess all the time) Each cpu core process 1/s of the m jobs and each cpu construct a small csc matrix with the method described in A. After a cpu finished all jobs, it broadcast the three sizes of the data that it put into each of the three arrays of its small csc matrix to other cpus. Once a cpu receives this information from all other cpu (there are s - 1 info to receive from other cpus), it calculates the position of its three arrays of its small csc matrix in the three arrays of the final csc matrix. Then it copies its small csc matrix to the shared csc matrix.

After that, the cpu sends a signal to other cpu. Once a cpu receives s - 1 signals, it starts the next cycle. (probably don't need this 2nd synchronization.)

Hopefully the synchronizing would waste much less time than the lock in A. And that this approach scales linearly from 1 cpu core to 12 cpu core. If anyone has experience with this, feel free to give suggestions.

A draw back of approach D is that it uses twice the amount of memory than the memory require to hold the final csc matrix. Half of the memory is for the shared arrays. The other half is used by all the arrays of the small csc matrices... This seems ok. In approach B, each cpu only has about 1/s of whole memory to use. Approach D is better.

Can keep the arrays of the small matrics as buffer so no time waste for recreating numpy arrays.

I am trying to work with B and optimize each job's memory usage per cpu core. But if approach A can avoid killing performance by a smarter design of lock, I would use A... I guess I will go with D. Related question: Efficiently populate SciPy sparse matrix from subset of dictionary

Without analyzing your setup: are you sure, that not the concept of iteratively filling (changing sparsity-structure) these matrices is the performance killer here? Normally one uses lil or dok matrices to incrementally build sparse matrices (which are then converted after; which is fast). Personal experience with incremental filling of csr_matrix was quite bad too and i had to switch too dok_matrix as intermediate step. (Something like this is also mentioned in the docs you refer to) — sascha
The dozen guys sharing the same tool is a much bigger problem here. I used the logging module of python to do the timing. A uses 12 cpu, C uses 1 cpu but A is only about twice as fast as C. Lil would use more memory... i am trying to optimize memory to death now. dok i haven't checked. oh and i wrote a custom python function for defining a buffer for the three arrays first then fill in the data, instead of using the one from scipy. That is much better except the sharing one tool problem still exists. — rxu
Just from my experiece, i could never build a csr_matrix directly (for something like 4Mx150k with < 0.001% nnz; in incremental way). It was just super super slow. But maybe your data is better-suited. I just checked and saw that i used coo_matrix (not lil or dok; sorry) as intermeditate step: you could use pythons array then for efficient storage of your elements and then exploit the fact that these arrays support from_buffer to build a coo_matrix efficiently. Then coo_matrix -> csr_matrix is efficient. — sascha

hpaulj hpaulj · Accepted Answer · 2016-06-04T15:46:30

It sounds like you have created a csr matrix, and are then modifying the 3 attribute arrays directly. And that you do so by appending to the existing arrays. While I have manipulated the data attribute of a csr, I have no experience directly changing the others. Seems that would be hard to do reliably and without the sparse code objecting.

An alternative is to construct 3 arrays first, and then construct the csr. And rather than use np.append for each row, append to lists, and use np.concatenate to join them.

If you used the coo style of input you could append the rows in any order. One job doesn't have to know anything about how many items there are in the previous rows.

coo style input is particularly nice with finite element stiffness matrices, where the matrixes for elements overlap. And coo input can go directly to csc.

I recently discussed constructing a sparse matrix from blocks here: https://stackoverflow.com/a/37040831/901925

and

tile operation to create a csr_matrix from one row of another csr_matrix

lil is good for incremental work because the attributes are two object arrays of lists - one list per row. So updating just requires replacing two empty lists with new lists - a pointer operation. But I think lil to csc is slower than coo to csc (but I haven't tested it).

Parallel Construction of Compressed Sparse Column Matrix

1 Answers