3
votes

I'm trying to find a vectorized/fast/numpy friendly way to convert the following values in column A, to column B:

ID  A   B
1   0   0
2   0   0
3   1   0
4   1   1
5   0   1
6   0   1
7   -1  1
8   0   0
9   1   0
10  0   1
11  0   1
12  1   1
13  0   1
14  -1  1
15  0   0

The algorithm to define column 'B' would be fill all gaps between groups of 1 and -1 with the value of 1, skipping the first row in each pair. That is, for ID4-ID7, column B is filled with ones (given the initial 1 in column A @ ID3). Next, from ID10-ID14 is filled with ones (since column A @ ID9 =1).

While this is easy to do with a for loop, I'm wondering if a non-loop solution exists? An O(n) loop based solution is below:

import numpy as np
import pandas as pd
x = np.array([ 0, 0, 1, 1, 0 ,0, -1, 0, 1, 0 , 0, 1, 0, -1, 0])


def make_y(x,showminus=False):
    y = x * 0
    state = 0 # are we in 1 or 0 or -1
    for i,n in enumerate(x):
        if n == 1 and n != state:
            state = n
            if i < len(y)-1:
                y[i+1] = state
        elif n == -1 and n != state:
            y[i] = state
            if showminus:
                state = -1
            else:
                state = 0
        else:
            y[i] = state
    return y

y = make_y(x)
print pd.DataFrame([x,y]).T

The above function yields the following performance on my machine:

%timeit y = make_y(x)
10000 loops, best of 3: 28 µs per loop

I'm guessing there must be some way to make the whole thing faster, as I will eventually need to deal with arrays that are 10million+ elements long...

2
Is the pattern always if A is 1 then next row B is 1 until -1 appears in A. That is the 1 and -1 mark the beginning and end of the contiguous 1s (but not including the row that the 1 appears in A)EdChum
@EdChum - that's correct. However, you may have noticed in the make_y loop function that there is a parameter to also keep track of the -1 regions as well. I left that part out of the question, in order to simplify things (initially).bazel
This is tricky, I can't think of way to do this without some iteration, you could get the indices of these markers using something like mask = df.loc[(df['A'].shift() == 1) | (df['A']==-1)] then collapse this again using mask.loc[(mask['A'] == -1) | (mask['A'].shift(-1) != -1)] which should then display the start and end indices and then iterate over or pull the indices into a list of tuple pairs where the pair has beg,end and set these to 1.EdChum

2 Answers

2
votes

A possible vectorized solution could be as follows

idx_1s, = np.where(x == -1)  # find the positions of the -1's
idx1s, = np.where(x == 1)  # find the positions of the 1's

To find which 1's should turn into 0's and mark the start of a block of 1's:

idx0s = np.concatenate(([0], np.searchsorted(idx1s, idx_1s[:-1])))
idx0s = idx1s[idx0s]

We now have two arrays of equal length, idx0s and idx_1s, marking the positions of the first and last item of each block, so we could now do:

y = x.copy()
y[idx0s] = 0
idx0s += 1
idx_1s += 1
mask = np.zeros_like(y, dtype=np.bool)
mask[idx0s] = True
mask[idx_1s] = True
mask = np.logical_xor.accumulate(mask)
y[mask] = 1

Which yields the desired:

>>> y
array([0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0])

It may be a little flimsy with malformed inputs, and I don't think it will handle trailing -1's gracefully. But the only non-O(n) operation is the call to searchsorted, but searchsorted has optimizations to make searched of sorted keys faster, so it will probably not be noticeable.

If I time this on your x, it doesn't beat the loop version, but for much larger arrays it probably will.

1
votes

This works fine,

A=[0,0,1,1,0,0,-1,0,1,0,0,1,0,-1,0]
B=[]
#initializing column with same number of zeros 
for j in range(len(A)):
    B.append(0)
print A
for i in range(len(A)):
    #retrieve the indices of pair (1 to -1)
    try:
            one_index=A.index(1)
            neg_one_index=A.index(-1)
    except:
            pass 
    one_index=one_index+1
    #replacing the zeros in column B by 1 at correct locations
    while one_index<=neg_one_index:
            B[one_index]=1
            A[one_index-1]=0
            A[one_index]=0
            one_index=one_index+1
print B
#output->[0,0,0,1,1,1,1,0,0,1,1,1,1,1,0] (i.e correct)