I have a scipy sparse matrix with 10e6 rows and 10e3 columns, populated to 1%. I also have an array of size 10e6 which contains keys corresponding to the 10e6 rows of my sparse matrix. I want to group my sparse matrix following these keys and aggregate with a sum function.
Example:
Keys:
['foo','bar','foo','baz','baz','bar']
Sparse matrix:
(0,1) 3 -> corresponds to the first 'foo' key
(0,10) 4 -> corresponds to the first 'bar' key
(2,1) 1 -> corresponds to the second 'foo' key
(1,3) 2 -> corresponds to the first 'baz' key
(2,3) 10 -> corresponds to the second 'baz' key
(2,4) 1 -> corresponds to the second 'bar' key
Expected result:
{
'foo': {1: 4}, -> 4 = 3 + 1
'bar': {4: 1, 10: 4},
'baz': {3: 12} -> 12 = 2 + 10
}
What is the more efficient way to do it?
I already tried to use pandas.SparseSeries.from_coo
on my sparse matrix in order to be able to use pandas group by but I get this known bug:
site-packages/pandas/tools/merge.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
863 for obj in objs:
864 if not isinstance(obj, NDFrame):
--> 865 raise TypeError("cannot concatenate a non-NDFrame object")
866
867 # consolidate
TypeError: cannot concatenate a non-NDFrame object
keys
list for each nonzero element in the matrix? As presented the problem looks more like a list and dictionary (dok
format sparse?) problem than an array/matrix one. – hpaulj