Numpy.array indexing question

Question

I am trying to create a 'mask' of a numpy.array by specifying certain criteria. Python even has nice syntax for something like this:

>> A = numpy.array([1,2,3,4,5])
>> A > 3
array([False, False, False, True, True])

But if I have a list of criteria instead of a range:

>> A = numpy.array([1,2,3,4,5])
>> crit = [1,3,5]

I can't do this:

>> A in crit

I have to do something based on list comprehensions, like this:

>> [a in crit for a in A]
array([True, False, True, False, True])

Which is correct.

Now, the problem is that I am working with large arrays and the above code is very slow. Is there a more natural way of doing this operation that might speed it up?

EDIT: I was able to get a small speedup by making crit into a set.

EDIT2: For those who are interested:

Jouni's approach: 1000 loops, best of 3: 102 µs per loop

numpy.in1d: 1000 loops, best of 3: 1.33 ms per loop

EDIT3: Just tested again with B = randint(10,size=100)

Jouni's approach: 1000 loops, best of 3: 2.96 ms per loop

numpy.in1d: 1000 loops, best of 3: 1.34 ms per loop

Conclusion: Use numpy.in1d() unless B is very small.

Justin Peel Justin Peel · Accepted Answer · 2010-10-21T19:05:49

I think that the numpy function in1d is what you are looking for:

>>> A = numpy.array([1,2,3,4,5])
>>> B = [1,3,5]
>>> numpy.in1d(A,crit)
array([ True, False,  True, False,  True], dtype=bool)

as stated in its docstring, "in1d(a, b) is roughly equivalent to np.array([item in b for item in a])"

Admittedly, I haven't done any speed tests, but it sounds like what you are looking for.

Another faster way

Here's another way to do it which is faster. Sort the B array first(containing the elements you are looking to find in A), turn it into a numpy array, and then do:

B[B.searchsorted(A)] == A

though if you have elements in A that are larger than the largest in B, you will need to do:

inds = B.searchsorted(A)
inds[inds == len(B)] = 0
mask = B[inds] == A

It may not be faster for small arrays (especially for B being small), but before long it will definitely be faster. Why? Because this is a O(N log M) algorithm, where N is the number of elements in A and M is the number of elements in M, putting together a bunch of individual masks is O(N * M). I tested it with N = 10000 and M = 14 and it was already faster. Anyway, just thought that you might like to know, especially if you are truly planning on using this on very large arrays.

Numpy.array indexing question

3 Answers