7
votes

I am trying to create a 'mask' of a numpy.array by specifying certain criteria. Python even has nice syntax for something like this:

>> A = numpy.array([1,2,3,4,5])
>> A > 3
array([False, False, False, True, True])

But if I have a list of criteria instead of a range:

>> A = numpy.array([1,2,3,4,5])
>> crit = [1,3,5]

I can't do this:

>> A in crit

I have to do something based on list comprehensions, like this:

>> [a in crit for a in A]
array([True, False, True, False, True])

Which is correct.

Now, the problem is that I am working with large arrays and the above code is very slow. Is there a more natural way of doing this operation that might speed it up?

EDIT: I was able to get a small speedup by making crit into a set.

EDIT2: For those who are interested:

Jouni's approach: 1000 loops, best of 3: 102 µs per loop

numpy.in1d: 1000 loops, best of 3: 1.33 ms per loop

EDIT3: Just tested again with B = randint(10,size=100)

Jouni's approach: 1000 loops, best of 3: 2.96 ms per loop

numpy.in1d: 1000 loops, best of 3: 1.34 ms per loop

Conclusion: Use numpy.in1d() unless B is very small.

3

3 Answers

6
votes

I think that the numpy function in1d is what you are looking for:

>>> A = numpy.array([1,2,3,4,5])
>>> B = [1,3,5]
>>> numpy.in1d(A,crit)
array([ True, False,  True, False,  True], dtype=bool)

as stated in its docstring, "in1d(a, b) is roughly equivalent to np.array([item in b for item in a])"

Admittedly, I haven't done any speed tests, but it sounds like what you are looking for.

Another faster way

Here's another way to do it which is faster. Sort the B array first(containing the elements you are looking to find in A), turn it into a numpy array, and then do:

B[B.searchsorted(A)] == A

though if you have elements in A that are larger than the largest in B, you will need to do:

inds = B.searchsorted(A)
inds[inds == len(B)] = 0
mask = B[inds] == A

It may not be faster for small arrays (especially for B being small), but before long it will definitely be faster. Why? Because this is a O(N log M) algorithm, where N is the number of elements in A and M is the number of elements in M, putting together a bunch of individual masks is O(N * M). I tested it with N = 10000 and M = 14 and it was already faster. Anyway, just thought that you might like to know, especially if you are truly planning on using this on very large arrays.

3
votes

Combine several comparisons with "or":

A = randint(10,size=10000)
mask = (A == 1) | (A == 3) | (A == 5)

Or if you have a list B and want to create the mask dynamically:

B = [1, 3, 5]
mask = zeros((10000,),dtype=bool)
for t in B: mask = mask | (A == t)
0
votes

Create a mask and use the compress function of the numpy array. It should be much faster. If you have a complex criteria, remember to construct it based on math of the arrays.

a = numpy.array([3,1,2,4,5])
mask = a > 3
b = a.compress(mask)

or

a = numpy.random.random_integers(1,5,100000)
c=a.compress((a<=4)*(a>=2)) ## numbers between n<=4 and n>=2
d=a.compress(~((a<=4)*(a>=2))) ## numbers either n>4 or n<2

Ok, if you want a mask that has all a in [1,3,5] you can do something like

a = numpy.random.random_integers(1,5,100000)
mask=(a==1)+(a==3)+(a==5)

or

a = numpy.random.random_integers(1,5,100000)
mask = numpy.zeros(len(a), dtype=bool)
for num in [1,3,5]:
    mask += (a==num)