vectorization with pandas series - multiple 'complex' boolean categorisations - run time optimisation

Question

As I process more and more data, the apply function I use is now to slow for my projets. I use really often vectorization in my work but for some function I tryed without success (yet).

The question is: How to vectorize this function containing multiple decision?

Please found here a not optimised code sample (using apply):

df = pd.DataFrame(np.random.randint(0,1000,size=(100000, 4)), columns=list('ABCD'))

def what_should_eat_that_cat(row):
    start_ = row[0]<=500
    end_ = row[1] <=500  
    miaw = row[2]<=200

    if start_ & end_:
        if miaw:
            return 'cat1'
        else:
            return 'cat2'        
    if start_ & ~end_:
        return 'cat3'   
    if ~start_ & end_:
        return 'cat4'
    else :
        return 'cat5'

start_time = time.time()

df.loc[:,'eat_cat'] = df.loc[:,['A','B','C']].apply(what_should_eat_that_cat,axis=1)

print("--- %s seconds ---" % (time.time() - start_time))

This take 16 seconds to process for 100k lines.

The result should be somethings like:

df.eat_cat => 0 cat5 1 cat5 2 cat3 3 cat5 4 cat4

Here is my progress so far.

def what_should_eat_that_cat(A,B,C):
    start_ = A <=500
    end_ = B <=500  
    miaw = C <=200

    if start_ & end_:
        if miaw:
            return 'cat1'
        else:
            return 'cat2'        
    if start_ & ~end_:
        return 'cat3'   
    if ~start_ & end_:
        return 'cat4'
    else :
        return 'cat5'

df.loc[:,'eat_cat'] = what_should_eat_that_cat(df.A, df.B, df.C)

I get this error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). I understand why. But I do not get how to vectorize anyway.

Here is some documentation about vectorization: https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6 According to this webste, this operation may run 50x faster.

you're doing start_ & end_thinking about True & True. But you're passing Series. Instead of calling a function and passing the dataframe's columns as parameters, use df.apply and use your function as parameter — Yuca
@Yuca: I agree with what you say but I have the impression that if I use apply, it will work but I increase my computing time again. I would like to completely vectorize this function to save computing time. — Ludo Schmidt
the reason why I'm writing a comment is to shed light on your error. If I had a solution I would write as an answer :) The df.apply is a suggestion that would help you overcome the error of ambiguous value. — Yuca

Ludo Schmidt Ludo Schmidt · Accepted Answer · 2018-10-26T13:25:03

I found the way to go 52x faster:

def categ(dataframe):
    start_ = dataframe.A <=500
    end_ = dataframe.B <=500  
    miaw = dataframe.C <=200

    #we treat each case separately in a vectorial way
    dataframe.loc[start_ & end_ & miaw, 'cat'] = 'cat1'
    dataframe.loc[start_ & end_ & ~miaw, 'cat'] = 'cat2'
    dataframe.loc[start_ & ~end_, 'cat'] = 'cat3'
    dataframe.loc[~start_ & end_, 'cat'] = 'cat4'
    dataframe.loc[~start_ & ~end_, 'cat'] = 'cat5'

    return dataframe.cat


df = pd.DataFrame(np.random.randint(0,1000,size=(100000, 4)), columns=list('ABCD'))

start_time = time.time()
df.loc[:,'eat_cat'] = categ (df)
print("--- %s seconds ---" % (time.time() - start_time))

This take 0.3 sec instead of 16 seconds (with apply). I hope this will help other that struggle as me on this.

vectorization with pandas series - multiple 'complex' boolean categorisations - run time optimisation

1 Answers