As I process more and more data, the apply function I use is now to slow for my projets. I use really often vectorization in my work but for some function I tryed without success (yet).
The question is: How to vectorize this function containing multiple decision?
Please found here a not optimised code sample (using apply):
df = pd.DataFrame(np.random.randint(0,1000,size=(100000, 4)), columns=list('ABCD'))
def what_should_eat_that_cat(row):
start_ = row[0]<=500
end_ = row[1] <=500
miaw = row[2]<=200
if start_ & end_:
if miaw:
return 'cat1'
else:
return 'cat2'
if start_ & ~end_:
return 'cat3'
if ~start_ & end_:
return 'cat4'
else :
return 'cat5'
start_time = time.time()
df.loc[:,'eat_cat'] = df.loc[:,['A','B','C']].apply(what_should_eat_that_cat,axis=1)
print("--- %s seconds ---" % (time.time() - start_time))
This take 16 seconds to process for 100k lines.
The result should be somethings like:
df.eat_cat => 0 cat5 1 cat5 2 cat3 3 cat5 4 cat4
Here is my progress so far.
def what_should_eat_that_cat(A,B,C):
start_ = A <=500
end_ = B <=500
miaw = C <=200
if start_ & end_:
if miaw:
return 'cat1'
else:
return 'cat2'
if start_ & ~end_:
return 'cat3'
if ~start_ & end_:
return 'cat4'
else :
return 'cat5'
df.loc[:,'eat_cat'] = what_should_eat_that_cat(df.A, df.B, df.C)
I get this error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). I understand why. But I do not get how to vectorize anyway.
Here is some documentation about vectorization: https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6 According to this webste, this operation may run 50x faster.
start_ & end_thinking aboutTrue & True. But you're passing Series. Instead of calling a function and passing the dataframe's columns as parameters, usedf.applyand use your function as parameter - Yucadf.applyis a suggestion that would help you overcome the error of ambiguous value. - Yuca