python - Is there any numpy group by function?

82

votes

Is there any function in numpy to group this array down below by the first column?

I couldn't find any good answer over the internet..

>>> a
array([[  1, 275],
       [  1, 441],
       [  1, 494],
       [  1, 593],
       [  2, 679],
       [  2, 533],
       [  2, 686],
       [  3, 559],
       [  3, 219],
       [  3, 455],
       [  4, 605],
       [  4, 468],
       [  4, 692],
       [  4, 613]])

Wanted output:

array([[[275, 441, 494, 593]],
       [[679, 533, 686]],
       [[559, 219, 455]],
       [[605, 468, 692, 613]]], dtype=object)

pythonarraysnumpy

41

votes

Inspired by Eelco Hoogendoorn's library, but without his library, and using the fact that the first column of your array is always increasing (if not, sort first with a = a[a[:, 0].argsort()])

>>> np.split(a[:,1], np.unique(a[:, 0], return_index=True)[1][1:])
[array([275, 441, 494, 593]),
 array([679, 533, 686]),
 array([559, 219, 455]),
 array([605, 468, 692, 613])]

I didn't "timeit" but this is probably the faster way to achieve the question :

No python native loop
Result lists are numpy arrays, in case you need to make other numpy operations on them, no new conversion will be needed
Complexity like O(n)

[EDIT] I improved the answer thanks to ns63sr's answer and Behzad Shayegh (cf comment)

32

votes

The numpy_indexed package (disclaimer: I am its author) aims to fill this gap in numpy. All operations in numpy-indexed are fully vectorized, and no O(n^2) algorithms were harmed during the making of this library.

import numpy_indexed as npi
npi.group_by(a[:, 0]).split(a[:, 1])

Note that it is usually more efficient to directly compute relevant properties over such groups (ie, group_by(keys).mean(values)), rather than first splitting into a list / jagged array.

23

votes

Numpy is not very handy here because the desired output is not an array of integers (it is an array of list objects).

I suggest either the pure Python way...

from collections import defaultdict

%%timeit
d = defaultdict(list)
for key, val in a:
    d[key].append(val)
10.7 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# result:
defaultdict(list,
        {1: [275, 441, 494, 593],
         2: [679, 533, 686],
         3: [559, 219, 455],
         4: [605, 468, 692, 613]})

...or the pandas way:

import pandas as pd

%%timeit
df = pd.DataFrame(a, columns=["key", "val"])
df.groupby("key").val.apply(pd.Series.tolist)
979 µs ± 3.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# result:
key
1    [275, 441, 494, 593]
2         [679, 533, 686]
3         [559, 219, 455]
4    [605, 468, 692, 613]
Name: val, dtype: object

13

votes

n = np.unique(a[:,0])
np.array( [ list(a[a[:,0]==i,1]) for i in n] )

outputs:

array([[275, 441, 494, 593], [679, 533, 686], [559, 219, 455],
       [605, 468, 692, 613]], dtype=object)

8

votes

Simplifying the answer of Vincent J and considering the comment of HS-nebula one can use return_index = True instead of return_counts = True and get rid of the cumsum:

np.split(a[:,1], np.unique(a[:,0], return_index = True)[1])[1:]

Output

[array([275, 441, 494, 593]),
 array([679, 533, 686]),
 array([559, 219, 455]),
 array([605, 468, 692, 613])]

1

votes

I used np.unique() followed by np.extract()

unique = np.unique(a[:, 0:1])
answer = []
for element in unique:
    present = a[:,0]==element
    answer.append(np.extract(present,a[:,-1]))
print (answer)

[array([275, 441, 494, 593]), array([679, 533, 686]), array([559, 219, 455]), array([605, 468, 692, 613])]

1

votes

given X as array of items you want to be grouped and y (1D array) as corresponding groups, following function does the grouping with numpy:

def groupby(X, y):
    y = np.asarray(y)
    X = np.asarray(X)
    y_uniques = np.unique(y)
    return [X[y==yi] for yi in y_uniques]

So, groupby(a[:,1], a[:,0]) returns [array([275, 441, 494, 593]), array([679, 533, 686]), array([559, 219, 455]), array([605, 468, 692, 613])]

0

votes

We might also find it useful to generate a dict:

def groupby(X): 
    X = np.asarray(X) 
    x_uniques = np.unique(X) 
    return {xi:X[X==xi] for xi in x_uniques}

Let's try it out:

X=[1,1,2,2,3,3,3,3,4,5,6,7,7,8,9,9,1,1,1]
groupby(X)                                                                                                      
Out[9]: 
{1: array([1, 1, 1, 1, 1]),
 2: array([2, 2]),
 3: array([3, 3, 3, 3]),
 4: array([4]),
 5: array([5]),
 6: array([6]),
 7: array([7, 7]),
 8: array([8]),
 9: array([9, 9])}

Note this by itself is not super compelling - but if we make X an object or namedtuple and then provide a groupby function it becomes more interesting. Will put that in later.

0

votes

Late to the party, but anyways. If you plan to not only group the arrays, but also want to do operations on them like sum, mean and so on, and you're doing this with speed in mind, you also might want to consider numpy_groupies. All those group operations are optimized and jitted with numba. They easily outperform the other mentioned solutions.

from numpy_groupies.aggregate_numpy import aggregate
aggregate(a[:,0], a[:,1], "array", fill_value=[])
>>> array([array([], dtype=int64), array([275, 441, 494, 593]),
           array([679, 533, 686]), array([559, 219, 455]),
           array([605, 468, 692, 613])], dtype=object)
aggregate(a[:,0], a[:,1], "sum")
>>> array([   0, 1803, 1898, 1233, 2378])

python - Is there any numpy group by function?

9 Answers