So let's create your matrix (too bad you didn't give inputs that I could copy-n-paste)
In [114]: data=[4,1,2,2,1,1,4,3,2]
In [115]: col=[0,1,1,2,2,3,4,4,4]
In [116]: row=[2,0,4,0,3,5,0,2,3]
In [117]: M=sparse.csr_matrix((data,(col,row)))
In [118]: M
Out[118]:
<5x6 sparse matrix of type '<type 'numpy.int32'>'
with 9 stored elements in Compressed Sparse Row format>
In [119]: M.A
Out[119]:
array([[0, 0, 4, 0, 0, 0],
[1, 0, 0, 0, 2, 0],
[2, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 1],
[4, 0, 3, 2, 0, 0]])
In [121]: center=np.array([[0,1,2,2,4,1],[3,4,1,2,4,0]])
So how did you calculate the distance? M.A
is (5,6), center
is (2,6). It's not obvious what you are doing with these two arrays.
As for access to the 'raw' sparse values, the coo
format is easiest to understand. It's the same row,col,data stuff I used to create the matrix
In [131]: M.tocoo().data
Out[131]: array([4, 1, 2, 2, 1, 1, 4, 3, 2])
In [132]: M.tocoo().col
Out[132]: array([2, 0, 4, 0, 3, 5, 0, 2, 3])
In [133]: M.tocoo().row
Out[133]: array([0, 1, 1, 2, 2, 3, 4, 4, 4])
The csr
stores the same info in data
, indices
and indptr
arrays. But you have to do some math to pull to the i,j
values from the last 2. csr
multiplication routines make good use of these arrays.
In general it is better to do multiplication with csr
matrices than addition/subtraction.
I await further clarification.
spatial.distance.cdist(center,M.A, 'euclidean')
Out[156]:
array([[ 5.09901951, 3.87298335, 5.19615242, 5. , 5.91607978],
[ 7.34846923, 5.38516481, 5.91607978, 6.8556546 , 6.08276253]])
What we need to do is study this function, and understand its inputs. We may have to go beyond its docs and look at the code.
But looking at this code I see steps to ensure that xB
is 2d array, with the same number of columns as xA
. Then for euclidian
it calls
_distance_wrap.cdist_euclidean_wrap(_convert_to_double(XA),
_convert_to_double(XB), dm)
which looks like a wrapper on some C code. I can't imagine any way of feeding it a sparse matrix.
You could iterate over rows; calling dist
with M[[0],:].A
is the same as M.A[[0],:]
- except for speed. Iterating over rows of a sparse matrix is kind of slow, as much because it has to construct a new sparse matrix at each iteration. csr
and lil
are the 2 fastest for row iteration.
Here's something that might be faster - directly iterating on the attributes of the lil
format:
def foo(a,b,n):
# make a dense array from data,row
res = np.zeros((1,n))
res[0,b]=a
return res
In [190]: Ml=M.tolil()
In [191]: Ml.data
Out[191]: array([[4], [1, 2], [2, 1], [1], [4, 3, 2]], dtype=object)
In [192]: Ml.rows
Out[192]: array([[2], [0, 4], [0, 3], [5], [0, 2, 3]], dtype=object)
In [193]: rowgen=(foo(a,b,6) for a,b in zip(Ml.data,Ml.rows))
In [194]: np.concatenate([spatial.distance.cdist(center,row, 'euclidean') for row in rowgen],axis=1)
Out[194]:
array([[ 5.09901951, 3.87298335, 5.19615242, 5. , 5.91607978],
[ 7.34846923, 5.38516481, 5.91607978, 6.8556546 , 6.08276253]])
For now I'll skip the time tests.
scipy.spatial.distance.cdist(center, data_csr, 'euclidean')
– Rochana Nana