Overwrite instead of add for duplicate triplets when creating sparse matrix in scipy

Question

In scipy, to create a sparse matrix from triple format data (row, col and data arrays), the default behavior is to sum the data values for all duplicates. Can I change this behavior to overwrite (or do nothing) instead?

For example:

import scipy.sparse as sparse

rows = [0, 0]
cols = [0, 0]
data = [1, 1]
S = sparse.coo_matrix((data, (rows, cols)))

Here, S.todense() is equal to matrix([[2]]) but I would wish it to be matrix([[1]]).

In the documentation of sparse.coo_matrix, it reads

By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like.

It appears from that formulation that there might be other options than the default.

hpaulj hpaulj · Accepted Answer · 2016-11-21T19:04:02

I've seen discussion on the scipy github about giving more control over this summing, but I don't know of any production changes. As the docs indicate, there's a long standing tradition over summing the duplicates.

As created, the coo matrix does not sum; it just assigns your parameters to its attributes:

In [697]: S = sparse.coo_matrix((data, (rows, cols)))
In [698]: S.data
Out[698]: array([1, 1])
In [699]: S.row
Out[699]: array([0, 0], dtype=int32)
In [700]: S.col
Out[700]: array([0, 0], dtype=int32)

Converting to dense (or to csr/csc) does sum - but doesn't change S itself:

In [701]: S.A
Out[701]: array([[2]])
In [702]: S.data
Out[702]: array([1, 1])

You can performing the summing inplace with:

In [703]: S.sum_duplicates()
In [704]: S.data
Out[704]: array([2], dtype=int32)

I don't know of a way of either removing the duplicates or bypassing that action. I may look up the relevant issue.

=================

S.todok() does an inplace sum (that is, changes S). Looking at that code I see that it calls self.sum_duplicates. The following replicates that without the sum:

In [727]: dok=sparse.dok_matrix((S.shape),dtype=S.dtype)
In [728]: dok.update(zip(zip(S.row,S.col),S.data))
In [729]: dok
Out[729]: 
<1x1 sparse matrix of type '<class 'numpy.int32'>'
    with 1 stored elements in Dictionary Of Keys format>
In [730]: print(dok)
  (0, 0)    1
In [731]: S
Out[731]: 
<1x1 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in COOrdinate format>
In [732]: dok.A
Out[732]: array([[1]])

It's a dictionary update, so the final value is the last of the duplicates. I found elsewhere that dok.update is a pretty fast way of adding values to a sparse matrix.

tocsr inherently does the sum; tolil uses tocsr; so this todok approach may be simplest.

Overwrite instead of add for duplicate triplets when creating sparse matrix in scipy

2 Answers