12
votes

I want to iteratively build sparse matrices, and noticed that there are two suitable options for this according to the SciPy documentation:

LiL matrix:

class scipy.sparse.lil_matrix(arg1, shape=None, dtype=None, copy=False)[source] Row-based linked list sparse matrix

This is an efficient structure for constructing sparse matrices incrementally.

DoK matrix:

class scipy.sparse.dok_matrix(arg1, shape=None, dtype=None, copy=False)[source] Dictionary Of Keys based sparse matrix.

This is an efficient structure for constructing sparse matrices incrementally.

But when I'm running benchmarks comparing to building a dictionary of dictionary of values (which later can be easily converted to sparse matrix), the latter turns out to be about 10-20 times faster than using any of the sparse matrix models:

from scipy.sparse import dok_matrix, lil_matrix
from timeit import timeit
from collections import defaultdict

def common_dict(rows, cols):
    freqs = defaultdict(lambda: defaultdict(int))
    for row, col in zip(rows, cols):
        freqs[row][col] += 1

    return freqs

def dok(rows, cols):
    freqs = dok_matrix((1000,1000))
    for row, col in zip(rows, cols):
        freqs[row,col] += 1

    return freqs

def lil(rows, cols):
    freqs = lil_matrix((1000,1000))
    for row, col in zip(rows, cols):
        freqs[row,col] += 1

    return freqs


def benchmark():
    cols = range(1000)
    rows = range(1000)

    res = timeit("common_dict({},{})".format(rows, cols), 
                 "from __main__ import common_dict", 
                 number=100)

    print("common_dict: {}".format(res))

    res = timeit("dok({},{})".format(rows, cols), 
                 "from __main__ import dok", 
                 number=100)

    print("dok: {}".format(res))

    res = timeit("lil({},{})".format(rows, cols), 
                 "from __main__ import lil", 
                 number=100)

    print("lil: {}".format(res))

Results:

benchmark()

common_dict: 0.11778324202168733
dok: 2.2927695910912007
lil: 1.3541790939634666

What is it that causes such a overhead for the matrix models, and is there some way to speed it up? Are there use cases where either dok or lil are to prefer over a common dict of dicts?

2

2 Answers

14
votes

When I change your += to just = for your 2 sparse arrays:

for row, col in zip(rows, cols):
    #freqs[row,col] += 1
    freqs[row,col] = 1

their respective times are cut in half. What's consuming the most time is the indexing. With += it is has to do both a __getitem__ and a __setitem__.

When the docs say that dok and lil are better for iterative construction they mean that it's easier to expand their underlying data structures than for the other formats.

When I try to make a csr matrix with your code, I get a:

/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:690: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient. SparseEfficiencyWarning)

and 30x slower speed.

So the speed claims are relative to formats like csr, not relative to pure Python or numpy structures.

You might want to look at the Python code for dok_matrix.__get_item__ and dok_matrix.__set_item__ to see what happens when you do freq[r,c].


A faster way to construct your dok would be:

freqs = dok_matrix((1000,1000))
d = dict()
for row, col in zip(rows, cols):
    d[(row, col)] = 1
freqs.update(d)

taking advantage of the fact that a dok is a subclassed dictionary. Note that dok matrix is not a dictionary of dictionaries. Its keys are tuples like (50,50).

Another fast way of constructing the same sparse array is:

freqs = sparse.coo_matrix((np.ones(1000,int),(rows,cols)))

In other words, since you already have the rows and cols arrays (or ranges), calculate the corresponding data array, and THEN construct the sparse array.

But if you must perform sparse operations on your matrix between incremental growth steps, then dok or lil may be your best choices.


Sparse matricies were developed for linear algebra problems, such as solving a linear equation with a large sparse matrix. I used them years ago in MATLAB to solve finite difference problems. For that work the calculation friendly csr format is the ultimate goal, and the coo format was a convenient initialization format.

Now many of the SO scipy sparse questions arise from scikit-learn and text analysis problems. They are also used in a biological database files. But still the (data),(row,col) definition method works best.

So sparse matrices were never intended for fast incremental creation. The traditional Python structures like dictionaries and lists are much better for that.


Here's a faster dok iteration that takes advantage of its dictionary methods. update seems to work as fast as on a plain dictionary. get is about 3x faster the equivalent indexing (freq[row,col]). Indexing probably uses get, but must have a lot of overhead.

def fast_dok(rows, cols):
    freqs = dok_matrix((1000,1000))
    for row, col in zip(rows,cols):
         i = freqs.get((row,col),0)
         freqs.update({(row,col):i+1})
    return freqs

Skipping the get, and just doing

         freqs.update({(row,col): 1)

is even faster - faster than the defaultdict of defaultdict example, and nearly as fast as simple dictionary initialization ({(r, c):1 for r,c in zip(rows, cols)})

1
votes

There are various reasons why your test is not fair. Firstly, you're including the overhead of constructing the sparse matrices as part of your timed loop.

Secondly, and arguably more importantly, you should use data structures as they are designed to be used, with operations on the whole array at once. That is, rather than iterating over the rows and columns and adding 1 each time, simply add 1 to the whole array.