0
votes

I have a Table in MySQL with three columns: row-index, column-index and value, which I want to read in into a scipy csr_matrix. I use the Python-MySQL connector. There are 112,500 non-zero elements.

Try 1:

A = csr_matrix((N_rows, N_cols), dtype=float)
show = 'SELECT * FROM my_table'
cursor.execute(show)
for (row, col, value) in cursor:
     A[row, col] = value

This is too slow, I had to stop it after 60 seconds. It mentioned an efficiency warning, and suggested to use lil matrices.

Try 2:

A = lil_matrix((N_rows, N_cols), dtype=float)
show = 'SELECT * FROM my_table'
cursor.execute(show)
for (row, col, value) in cursor:
     A[row, col] = value
A = csr_matrix(A)

This takes 6.4 seconds (average of three). Is this as good as it gets, or is there a quicker way in which I can create the csr_matrix without going through a loop? If I perform cursor.fetchall(), the data looks like:

[(row_0, col_0, value_0), (row_1, col_1, value_1), ...]

This cannot be used for the csr_matrix constructor.

1

1 Answers

4
votes

The data returned by cursor.fetchall() is almost in the coo_matrix format. You can do

import numpy as np
from scipy.sparse import coo_matrix

data = cursor.fetchall()
#data = [(1, 2, 1.2), (3, 4, 7.1)]

arr = np.array(data, dtype=[('row', int), ('col', int), ('value', float)])
spmat = coo_matrix((arr['value'], (arr['row'], arr['col'])))

Instead of np.array(cursor.fetchall(), ...) you could also preferably use

arr = np.fromiter(cursor, dtype=[('row', int), ('col', int), ('value', float)])

to load the data from the DB directly into the Numpy array.