I want to iteratively build sparse matrices, and noticed that there are two suitable options for this according to the SciPy documentation:
class scipy.sparse.lil_matrix(arg1, shape=None, dtype=None, copy=False)[source] Row-based linked list sparse matrix
This is an efficient structure for constructing sparse matrices incrementally.
class scipy.sparse.dok_matrix(arg1, shape=None, dtype=None, copy=False)[source] Dictionary Of Keys based sparse matrix.
This is an efficient structure for constructing sparse matrices incrementally.
But when I'm running benchmarks comparing to building a dictionary of dictionary of values (which later can be easily converted to sparse matrix), the latter turns out to be about 10-20 times faster than using any of the sparse matrix models:
from scipy.sparse import dok_matrix, lil_matrix
from timeit import timeit
from collections import defaultdict
def common_dict(rows, cols):
freqs = defaultdict(lambda: defaultdict(int))
for row, col in zip(rows, cols):
freqs[row][col] += 1
return freqs
def dok(rows, cols):
freqs = dok_matrix((1000,1000))
for row, col in zip(rows, cols):
freqs[row,col] += 1
return freqs
def lil(rows, cols):
freqs = lil_matrix((1000,1000))
for row, col in zip(rows, cols):
freqs[row,col] += 1
return freqs
def benchmark():
cols = range(1000)
rows = range(1000)
res = timeit("common_dict({},{})".format(rows, cols),
"from __main__ import common_dict",
number=100)
print("common_dict: {}".format(res))
res = timeit("dok({},{})".format(rows, cols),
"from __main__ import dok",
number=100)
print("dok: {}".format(res))
res = timeit("lil({},{})".format(rows, cols),
"from __main__ import lil",
number=100)
print("lil: {}".format(res))
Results:
benchmark()
common_dict: 0.11778324202168733
dok: 2.2927695910912007
lil: 1.3541790939634666
What is it that causes such a overhead for the matrix models, and is there some way to speed it up? Are there use cases where either dok or lil are to prefer over a common dict of dicts?