Python/Numpy - Fill gaps between non-consecutive points?

Question

I'm trying to find a vectorized/fast/numpy friendly way to convert the following values in column A, to column B:

The algorithm to define column 'B' would be fill all gaps between groups of 1 and -1 with the value of 1, skipping the first row in each pair. That is, for ID4-ID7, column B is filled with ones (given the initial 1 in column A @ ID3). Next, from ID10-ID14 is filled with ones (since column A @ ID9 =1).

While this is easy to do with a for loop, I'm wondering if a non-loop solution exists? An O(n) loop based solution is below:

import numpy as np
import pandas as pd
x = np.array([ 0, 0, 1, 1, 0 ,0, -1, 0, 1, 0 , 0, 1, 0, -1, 0])


def make_y(x,showminus=False):
    y = x * 0
    state = 0 # are we in 1 or 0 or -1
    for i,n in enumerate(x):
        if n == 1 and n != state:
            state = n
            if i < len(y)-1:
                y[i+1] = state
        elif n == -1 and n != state:
            y[i] = state
            if showminus:
                state = -1
            else:
                state = 0
        else:
            y[i] = state
    return y

y = make_y(x)
print pd.DataFrame([x,y]).T

The above function yields the following performance on my machine:

%timeit y = make_y(x)
10000 loops, best of 3: 28 µs per loop

I'm guessing there must be some way to make the whole thing faster, as I will eventually need to deal with arrays that are 10million+ elements long...

Is the pattern always if A is 1 then next row B is 1 until -1 appears in A. That is the 1 and -1 mark the beginning and end of the contiguous 1s (but not including the row that the 1 appears in A) — EdChum
@EdChum - that's correct. However, you may have noticed in the make_y loop function that there is a parameter to also keep track of the -1 regions as well. I left that part out of the question, in order to simplify things (initially). — bazel
This is tricky, I can't think of way to do this without some iteration, you could get the indices of these markers using something like mask = df.loc[(df['A'].shift() == 1) | (df['A']==-1)] then collapse this again using mask.loc[(mask['A'] == -1) | (mask['A'].shift(-1) != -1)] which should then display the start and end indices and then iterate over or pull the indices into a list of tuple pairs where the pair has beg,end and set these to 1. — EdChum

Jaime Jaime · Accepted Answer · 2014-09-26T14:26:40

A possible vectorized solution could be as follows

idx_1s, = np.where(x == -1)  # find the positions of the -1's
idx1s, = np.where(x == 1)  # find the positions of the 1's

To find which 1's should turn into 0's and mark the start of a block of 1's:

idx0s = np.concatenate(([0], np.searchsorted(idx1s, idx_1s[:-1])))
idx0s = idx1s[idx0s]

We now have two arrays of equal length, idx0s and idx_1s, marking the positions of the first and last item of each block, so we could now do:

y = x.copy()
y[idx0s] = 0
idx0s += 1
idx_1s += 1
mask = np.zeros_like(y, dtype=np.bool)
mask[idx0s] = True
mask[idx_1s] = True
mask = np.logical_xor.accumulate(mask)
y[mask] = 1

Which yields the desired:

>>> y
array([0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0])

It may be a little flimsy with malformed inputs, and I don't think it will handle trailing -1's gracefully. But the only non-O(n) operation is the call to searchsorted, but searchsorted has optimizations to make searched of sorted keys faster, so it will probably not be noticeable.

If I time this on your x, it doesn't beat the loop version, but for much larger arrays it probably will.

Python/Numpy - Fill gaps between non-consecutive points?

2 Answers