2
votes

pandas apply/map is my nemesis and even on small datasets can be agonizingly slow. Below is a very simple example where there is nearly a 3 order of magnitude difference in speed. Below I create a Series with 1 million values and simply want to map values greater than .5 to 'Yes' and those less than .5 to 'No'. How do I vectorize this or speed it up significantly?

ser = pd.Series(np.random.rand(1000000))

# vectorized and fast
%%timeit
ser > .5

1000 loops, best of 3: 477 µs per loop

%%timeit
ser.map(lambda x: 'Yes' if x > .5 else 'No')

1 loop, best of 3: 255 ms per loop

1

1 Answers

6
votes

np.where(cond, A, B) is the vectorized equivalent of A if cond else B:

import numpy as np
import pandas as pd
ser = pd.Series(np.random.rand(1000000))
mask = ser > 0.5
result = pd.Series(np.where(mask, 'Yes', 'No'))
expected = ser.map(lambda x: 'Yes' if x > .5 else 'No')
assert result.equals(expected)

In [77]: %timeit mask = ser > 0.5
1000 loops, best of 3: 1.44 ms per loop

In [76]: %timeit np.where(mask, 'Yes', 'No')
100 loops, best of 3: 14.8 ms per loop

In [73]: %timeit pd.Series(np.where(mask, 'Yes', 'No'))
10 loops, best of 3: 86.5 ms per loop

In [74]: %timeit ser.map(lambda x: 'Yes' if x > .5 else 'No')
1 loop, best of 3: 223 ms per loop

Since this Series only has two values, you might consider using a Categorical instead:

In [94]: cat = pd.Categorical.from_codes(codes=mask.astype(int), categories=['Yes', 'No']); cat
Out[94]: 
[No, Yes, No, Yes, Yes, ..., Yes, No, Yes, Yes, No]
Length: 1000000
Categories (2, object): [Yes, No]

In [95]: %timeit pd.Categorical.from_codes(codes=mask.astype(int), categories=['Yes', 'No']); cat
100 loops, best of 3: 6.26 ms per loop

Not only is this faster, it is more memory efficient since it avoids creating the array of strings. The category codes are an array of ints which map to categories:

In [96]: cat.codes
Out[96]: array([1, 0, 1, ..., 0, 0, 1], dtype=int8)

In [97]: cat.categories
Out[99]: Index(['Yes', 'No'], dtype='object')