Broadcast 1D array against 2D array for lexsort : Permutation for sorting each column independently when considering yet another vector

Question

Consider the array a

np.random.seed([3,1415])
a = np.random.randint(10, size=(5, 4))
a

array([[0, 2, 7, 3],
       [8, 7, 0, 6],
       [8, 6, 0, 2],
       [0, 4, 9, 7],
       [3, 2, 4, 3]])

I can create b which contains the permutation to sort each column.

b = a.argsort(0)
b

array([[0, 0, 1, 2],
       [3, 4, 2, 0],
       [4, 3, 4, 4],
       [1, 2, 0, 1],
       [2, 1, 3, 3]])

I can sort a with b

a[b, np.arange(a.shape[1])[None, :]]

array([[0, 2, 0, 2],
       [0, 2, 0, 3],
       [3, 4, 4, 3],
       [8, 6, 7, 6],
       [8, 7, 9, 7]])

That was the primer to illustrate the output I'm looking for. I want an array b that has the required permutation for sorting the corresponding column in a when also considering a lexsort with another array.

np.random.seed([3,1415])
a = np.random.randint(10, size=(10, 4))
g = np.random.choice(list('abc'), 10)

a

array([[0, 2, 7, 3],
       [8, 7, 0, 6],
       [8, 6, 0, 2],
       [0, 4, 9, 7],
       [3, 2, 4, 3],
       [3, 6, 7, 7],
       [4, 5, 3, 7],
       [5, 9, 8, 7],
       [6, 4, 7, 6],
       [2, 6, 6, 5]])

g

array(['c', 'a', 'c', 'b', 'a', 'a', 'a', 'b', 'c', 'b'], 
      dtype='<U1')

I want to produce an array b where each column is the requisite permutation to lexsort the corresponding column a. And the lexsort is from sorting the column first by the groups defined by g and then by the values in each column in a.

I can generate the results with:

r = np.column_stack([np.lexsort([a[:, i], g]) for i in range(a.shape[1])])
r

array([[4, 4, 1, 4],
       [5, 6, 6, 1],
       [6, 5, 4, 5],
       [1, 1, 5, 6],
       [3, 3, 9, 9],
       [9, 9, 7, 3],
       [7, 7, 3, 7],
       [0, 0, 2, 2],
       [8, 8, 0, 0],
       [2, 2, 8, 8]])

We can see that this works

g[r]

array([['a', 'a', 'a', 'a'],
       ['a', 'a', 'a', 'a'],
       ['a', 'a', 'a', 'a'],
       ['a', 'a', 'a', 'a'],
       ['b', 'b', 'b', 'b'],
       ['b', 'b', 'b', 'b'],
       ['b', 'b', 'b', 'b'],
       ['c', 'c', 'c', 'c'],
       ['c', 'c', 'c', 'c'],
       ['c', 'c', 'c', 'c']], 
      dtype='<U1')

and

a[r, np.arange(a.shape[1])[None, :]]

array([[3, 2, 0, 3],
       [3, 5, 3, 6],
       [4, 6, 4, 7],
       [8, 7, 7, 7],
       [0, 4, 6, 5],
       [2, 6, 8, 7],
       [5, 9, 9, 7],
       [0, 2, 0, 2],
       [6, 4, 7, 3],
       [8, 6, 7, 6]])

Question

Is there a way to "broadcast" the use of the grouping array g for use in every columns lexsort? What is a more efficient way to do this?

Divakar Divakar · Accepted Answer · 2017-05-25T22:45:02

Here's one approach -

def app1(a, g):
    m,n = a.shape

    g_idx = np.unique(g, return_inverse=1)[1]
    N = g_idx.max()+1

    g_idx2D = g_idx[:,None] + N*np.arange(n)
    r_out = np.lexsort([a.ravel('F'), g_idx2D.ravel('F')]).reshape(-1,m).T
    r_out -= m*np.arange(n)
    return r_out

The idea is simply that we create a 2D grid of integer version of g array of strings and then offset each column by a barrier that would limit the lexsort search within each column.

Now, on performance, it seems for large datasets, lexsort itself would be the bottleneck. For our problem, we are dealing with just two columns. So, we can create our own custom lexsort that scales the second column based on an offset, which is the max limit of number from the first column. The implementation for the same would look something like this -

def lexsort_twocols(A, B):
    S = A.max() - A.min() + 1
    return (B*S + A).argsort()

Thus, incorporating this into our proposed method and optimizing the creation of g_idx2D, we would have a formal function like so -

def proposed_app(a, g):
    m,n = a.shape

    g_idx = np.unique(g, return_inverse=1)[1]
    N = g_idx.max()+1

    g_idx2D = (g_idx + N*np.arange(n)[:,None]).ravel()
    r_out = lexsort_twocols(a.ravel('F'), g_idx2D).reshape(-1,m).T    
    r_out -= m*np.arange(n)
    return r_out

Runtime test

Original approach :

def org_app(a, g):
    return np.column_stack([np.lexsort([a[:, i], g]) for i in range(a.shape[1])])

Timings -

In [763]: a = np.random.randint(10, size=(20, 10000))
     ...: g = np.random.choice(list('abcdefgh'), 20)
     ...: 

In [764]: %timeit org_app(a,g)
10 loops, best of 3: 27.7 ms per loop

In [765]: %timeit app1(a,g)
10 loops, best of 3: 25.4 ms per loop

In [766]: %timeit proposed_app(a,g)
100 loops, best of 3: 5.93 ms per loop

Broadcast 1D array against 2D array for lexsort : Permutation for sorting each column independently when considering yet another vector

2 Answers