Compressed Sparse Column Matrix (csc) with m columns, as defined here.
It has three arrays: Data, Indices, and Indptr.
- indptr is a 1D array with a size of m+1
- indices is row indices of all non-zero entries of the matrix.
- indices[indptr[i]:indptr[i + 1]] is row indices of all non-zero entries of the i th column. The values of these non-zero entries are defined in data[indptr[i]:indptr[i + 1]]
Compressed sparse row (csr) matrix with n rows has a similar definition.
Problem:
Cpu works on n jobs. The p th job produces data for the q th row of a sparse matrix (csc or csr) with n rows and m columns. p and q don't have to be equal, as long as which job corresponds to which row is recorded.
How to efficiently collect the data from the jobs and produce a csc matrix?
Current solutions:
Approach A.
- Empty arrays called shared_data, shared_indices, shared_indptr (they defines a csr matrix), shared_ordering are created. Sizes of shared_data and shared_indices depends on an estimation of number of nonzero entries. Set shared_indptr[0] = 0. These arrays are shared by jobs.
- Counters of number of non-zero entries (count_nz) and number of rows entered to the csr matrix (count_row) are also shared by jobs.
- Accessing resources described in 1 and 2 requires acquiring a lock.
- Each job knows the column indices of the nonzero entries (job_indices) that it tries to put into the csr matrix. It also knows the values of the nonzero entries (job_data), and the number of the nonzero entries (job_nz).
- Each job tries to acquire lock.
- Upon acquiring the lock, a job does the following. Then release lock
.
after acquiring lock
shared_data[counter_nz:(counter_nz + job_nz)] = job_data
shared_indices[counter_nz:(counter_nz + job_nz)] = job_indices
shared_indptr[count_row + 1] = shared_indptr[count_row] + job_nz
shared_ordering[counter_row] = job_id
counter_nz += job_nz
counter_row += 1
release lock.
create a csr matrix object by wrapping the shared arrays (like scipy.sparse.csr_matrix((shared_data, shared_indices, shared_indptr)).
Convert the csr to a csc matrix (the goal is to make a csc matrix instead of a csr matrix ...)
The problem with this approach is the lock really kills performance. It is like a dozen people tries to grab the same tool to work on different parts of a project.
Approach B
Each cpu works on m jobs and build a csc matrix with the method described in A. This way uses much more memory, but the matrices are generated more quickly.
Approach C
One cpu works on m job and build a csc matrix. This takes roughly twice more time than approach A on a 12 cpu machine.
Approach D (Update 06/05/2016)
There are s cpu cores. The shared_data, shared_indptr, and shared_indices arrays of the final csc matrix and the shared_ordering array are shared by all cpu cores. No lock needed. Each cpu core runs a single subprocess (instead of starting and stopping subprocess all the time) Each cpu core process 1/s of the m jobs and each cpu construct a small csc matrix with the method described in A. After a cpu finished all jobs, it broadcast the three sizes of the data that it put into each of the three arrays of its small csc matrix to other cpus. Once a cpu receives this information from all other cpu (there are s - 1 info to receive from other cpus), it calculates the position of its three arrays of its small csc matrix in the three arrays of the final csc matrix. Then it copies its small csc matrix to the shared csc matrix.
After that, the cpu sends a signal to other cpu. Once a cpu receives s - 1 signals, it starts the next cycle. (probably don't need this 2nd synchronization.)
Hopefully the synchronizing would waste much less time than the lock in A. And that this approach scales linearly from 1 cpu core to 12 cpu core. If anyone has experience with this, feel free to give suggestions.
A draw back of approach D is that it uses twice the amount of memory than the memory require to hold the final csc matrix. Half of the memory is for the shared arrays. The other half is used by all the arrays of the small csc matrices... This seems ok. In approach B, each cpu only has about 1/s of whole memory to use. Approach D is better.
Can keep the arrays of the small matrics as buffer so no time waste for recreating numpy arrays.
I am trying to work with B and optimize each job's memory usage per cpu core. But if approach A can avoid killing performance by a smarter design of lock, I would use A... I guess I will go with D. Related question: Efficiently populate SciPy sparse matrix from subset of dictionary