I am using Pandas dataframes and want to create a new column as a function of existing columns. I have not seen a good discussion of the speed difference between df.apply()
and np.vectorize()
, so I thought I would ask here.
The Pandas apply()
function is slow. From what I measured (shown below in some experiments), using np.vectorize()
is 25x faster (or more) than using the DataFrame function apply()
, at least on my 2016 MacBook Pro. Is this an expected result, and why?
For example, suppose I have the following dataframe with N
rows:
N = 10
A_list = np.random.randint(1, 100, N)
B_list = np.random.randint(1, 100, N)
df = pd.DataFrame({'A': A_list, 'B': B_list})
df.head()
# A B
# 0 78 50
# 1 23 91
# 2 55 62
# 3 82 64
# 4 99 80
Suppose further that I want to create a new column as a function of the two columns A
and B
. In the example below, I'll use a simple function divide()
. To apply the function, I can use either df.apply()
or np.vectorize()
:
def divide(a, b):
if b == 0:
return 0.0
return float(a)/b
df['result'] = df.apply(lambda row: divide(row['A'], row['B']), axis=1)
df['result2'] = np.vectorize(divide)(df['A'], df['B'])
df.head()
# A B result result2
# 0 78 50 1.560000 1.560000
# 1 23 91 0.252747 0.252747
# 2 55 62 0.887097 0.887097
# 3 82 64 1.281250 1.281250
# 4 99 80 1.237500 1.237500
If I increase N
to real-world sizes like 1 million or more, then I observe that np.vectorize()
is 25x faster or more than df.apply()
.
Below is some complete benchmarking code:
import pandas as pd
import numpy as np
import time
def divide(a, b):
if b == 0:
return 0.0
return float(a)/b
for N in [1000, 10000, 100000, 1000000, 10000000]:
print ''
A_list = np.random.randint(1, 100, N)
B_list = np.random.randint(1, 100, N)
df = pd.DataFrame({'A': A_list, 'B': B_list})
start_epoch_sec = int(time.time())
df['result'] = df.apply(lambda row: divide(row['A'], row['B']), axis=1)
end_epoch_sec = int(time.time())
result_apply = end_epoch_sec - start_epoch_sec
start_epoch_sec = int(time.time())
df['result2'] = np.vectorize(divide)(df['A'], df['B'])
end_epoch_sec = int(time.time())
result_vectorize = end_epoch_sec - start_epoch_sec
print 'N=%d, df.apply: %d sec, np.vectorize: %d sec' % \
(N, result_apply, result_vectorize)
# Make sure results from df.apply and np.vectorize match.
assert(df['result'].equals(df['result2']))
The results are shown below:
N=1000, df.apply: 0 sec, np.vectorize: 0 sec
N=10000, df.apply: 1 sec, np.vectorize: 0 sec
N=100000, df.apply: 2 sec, np.vectorize: 0 sec
N=1000000, df.apply: 24 sec, np.vectorize: 1 sec
N=10000000, df.apply: 262 sec, np.vectorize: 4 sec
If np.vectorize()
is in general always faster than df.apply()
, then why is np.vectorize()
not mentioned more? I only ever see StackOverflow posts related to df.apply()
, such as:
pandas create new column based on values from other columns
np.vectorize
is basically a pythonfor
loop (it's a convenience method) andapply
with a lambda is also in python time – roganjoshapply
on a row-by-row basis unless you have to, and obviously a vectorized function will out-perform a non-vectorized one. – PMendenp.vectorize
is not vectorized. It's a well-known misnomer – roganjosh.str
accessors. They're slower than list comprehensions in a lot of cases. We assume too much. – roganjosh