4
votes

This is a typical use case for FEM/FVM equation systems, so is perhaps of broader interest. From a triangular mesh à la

enter image description here

I would like to create a scipy.sparse.csr_matrix. The matrix rows/columns represent values at the nodes of the mesh. The matrix has entries on the main diagonal and wherever two nodes are connected by an edge.

Here's an MWE that first builds a node->edge->cells relationship and then builds the matrix:

import numpy
import meshzoo
from scipy import sparse

nx = 1600
ny = 1000
verts, cells = meshzoo.rectangle(0.0, 1.61, 0.0, 1.0, nx, ny)

n = len(verts)

nds = cells.T
nodes_edge_cells = numpy.stack([nds[[1, 2]], nds[[2, 0]],nds[[0, 1]]], axis=1)

# assign values to each edge (per cell)
alpha = numpy.random.rand(3, len(cells))
vals = numpy.array([
    [alpha**2, -alpha],
    [-alpha, alpha**2],
    ])

# Build I, J, V entries for COO matrix
I = []
J = []
V = []
#
V.append(vals[0][0])
V.append(vals[0][1])
V.append(vals[1][0])
V.append(vals[1][1])
#
I.append(nodes_edge_cells[0])
I.append(nodes_edge_cells[0])
I.append(nodes_edge_cells[1])
I.append(nodes_edge_cells[1])
#
J.append(nodes_edge_cells[0])
J.append(nodes_edge_cells[1])
J.append(nodes_edge_cells[0])
J.append(nodes_edge_cells[1])
# Create suitable data for coo_matrix
I = numpy.concatenate(I).flat
J = numpy.concatenate(J).flat
V = numpy.concatenate(V).flat

matrix = sparse.coo_matrix((V, (I, J)), shape=(n, n))
matrix = matrix.tocsr()

With

python -m cProfile -o profile.prof main.py
snakeviz profile.prof

one can create and view a profile of the above:

enter image description here

The method tocsr() takes the lion share of the runtime here, but this is also true when building alpha is more complex. Consequently, I'm looking for ways to speed this up.

What I've already found:

  • Due to the structure of the data, the values on the diagonal of the matrix can be summed up in advance, i.e.,

    V.append(vals[0, 0, 0] + vals[1, 1, 2])
    I.append(nodes_edge_cells[0, 0])  # == nodes_edge_cells[1, 2]
    J.append(nodes_edge_cells[0, 0])  # == nodes_edge_cells[1, 2]
    

    This makes I, J, V shorter and thus speeds up tocsr.

  • Right now, edges are "per cell". I could identify equal edges with each other using numpy.unique, effectively saving about half of I, J, V. However, I found that this too takes some time. (Not surprising.)

One other thought that I had was that that I could replace the diagonal V, I, J by a simple numpy.add.at if there was a csr_matrix-like data structure where the main diagonal is kept separately. I know that this exists in some other software packages, but couldn't find it in scipy. Correct?

Perhaps there's a sensible way to construct CSR directly?

2
I'd like to see some timings for the different steps. The tocsr uses compiled code. I would think that any massaging of the coo inputs before hand would take just as long if not more. - hpaulj
Are you sure tocsr is taking a long time? I've done something very similar for a 10k by 10k matrix where I, J, V had lengths well into the millions, and it didn't take all that long. Maybe 5-10 seconds. - BallpointBen
Thanks for the comments. I've added a profile to the original post. - Nico Schlömer
Interesting. I thought the sum_duplicates was part of the compiled csr construction, but your profile shows it's a Python method in coo.py. It uses lexsort, np.nonzero and reduceat. So doing your own sum_duplicates has promise. If your coo has_canonical_format the tocsr should be much faster. - hpaulj
csr has its own sum_duplicates method. That uses a compiled sparse._sparsetools.csr_sum_duplicates. - hpaulj

2 Answers

3
votes

I would try creating the csr structure directly, especially if you are resorting to np.unique since this gives you sorted keys, which is half the job done.

I'm assuming you are at the point where you have i, j sorted lexicographically and overlapping v summed using np.add.at on the optional inverse output of np.unique.

Then v and j are already in csr format. All that's left to do is creating the indptr which you simply get by np.searchsorted(i, np.arange(M+1)) where M is the column length. You can pass these directly to the sparse.csr_matrix constructor.

Ok, let code speak:

import numpy as np
from scipy import sparse
from timeit import timeit


def tocsr(I, J, E, N):
    n = len(I)
    K = np.empty((n,), dtype=np.int64)
    K.view(np.int32).reshape(n, 2).T[...] = J, I  
    S = np.argsort(K)
    KS = K[S]
    steps = np.flatnonzero(np.r_[1, np.diff(KS)])
    ED = np.add.reduceat(E[S], steps)
    JD, ID = KS[steps].view(np.int32).reshape(-1, 2).T
    ID = np.searchsorted(ID, np.arange(N+1))
    return sparse.csr_matrix((ED, np.array(JD, dtype=int), ID), (N, N))

def viacoo(I, J, E, N):
    return sparse.coo_matrix((E, (I, J)), (N, N)).tocsr()


#testing and timing

# correctness
N = 1000
A = np.random.random((N, N)) < 0.001
I, J = np.where(A)

E = np.random.random((2, len(I)))
D = np.zeros((2,) + A.shape)
D[:, I, J] = E
D2 = tocsr(np.r_[I, I], np.r_[J, J], E.ravel(), N).A

print('correct:', np.allclose(D.sum(axis=0), D2))

# speed
N = 100000
K = 10

I, J = np.random.randint(0, N, (2, K*N))
E = np.random.random((2 * len(I),))
I, J, E = np.r_[I, I, J, J], np.r_[J, J, I, I], np.r_[E, E]

print('N:', N, ' --  nnz (with duplicates):', len(E))
print('direct: ', timeit('f(a,b,c,d)', number=10, globals={'f': tocsr, 'a': I, 'b': J, 'c': E, 'd': N}), 'secs for 10 iterations')
print('via coo:', timeit('f(a,b,c,d)', number=10, globals={'f': viacoo, 'a': I, 'b': J, 'c': E, 'd': N}), 'secs for 10 iterations')

Prints:

correct: True
N: 100000  --  nnz (with duplicates): 4000000
direct:  7.702431229001377 secs for 10 iterations
via coo: 41.813509466010146 secs for 10 iterations

Speedup: 5x

0
votes

So, in the end this turned out to be the difference between COO's and CSR's sum_duplicates (just like @hpaulj suspected). Thanks to the efforts of everyone involved here (particularly @paul-panzer), a PR is underway to give tocsr a tremendous speedup.

SciPy's tocsr does a lexsort on (I, J), so it helps organizing the indices in such a way that (I, J) will come out fairly sorted already.

For for nx=4, ny=2 in the above example, I and J are

[1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7]
[1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7 1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7]

First sorting each row of cells, then the rows by the first column like

cells = numpy.sort(cells, axis=1)
cells = cells[cells[:, 0].argsort()]

produces

[1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6]
[1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6 1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6]

For the number in the original post, sorting cuts down the runtime from about 40 seconds to 8 seconds.

Perhaps an even better ordering can be achieved if the nodes are numbered more appropriately in the first place. I'm thinking of Cuthill-McKee and friends.