How do I speed up applying a function to a large pandas dataframe?

Question

So I started yesterday on applying a function to a decent size dataset (6 million rows) but it's taking forever. I'm even trying to use pandarallel but that is not working well either. In any case, here is the code that I'm using...

def classifyForecast(dataframe):

    buckets = len(dataframe[dataframe['QUANTITY'] != 0])

    try:
        adi = dataframe.shape[0] / buckets
        cov = dataframe['QUANTITY'].std() / dataframe['QUANTITY'].mean()

        if adi < 1.32:
            if cov < .49:
                dataframe['TYPE'] = 'Smooth'
            else:
                dataframe['TYPE'] = 'Erratic'
        else:
            if cov < .49:
                dataframe['TYPE'] = 'Intermittent'
            else:
                dataframe['TYPE'] = 'Lumpy'

    except:
        dataframe['TYPE'] = 'Smooth'
    
    try:
        dataframe['ADI'] = adi
    except:
        dataframe['ADI'] = np.inf
    try:
        dataframe['COV'] = cov
    except:
        dataframe['COV'] = np.inf
    

    return dataframe

from pandarallel import pandarallel

pandarallel.initialize()

def quick_classification(df):
    return df.parallel_apply(classifyForecast(df))

Also, please note that I am splitting the dataframe up into batches. I don't want the function to work on each row, but instead I want it to work on the chunks. That way I can get the .mean() and .std() of specific columns.

It shouldn't take 48 hours to complete. How do I speed this up?

If that's literally your code, that's not how you use apply. You are supposed to pass a function to parallel_apply, and that function will be called once for each row. You are not PASSING the function, you are CALLING your function. It will do its work in the normal method, then return a dataframe. You then pass that dataframe to parallel_apply. Who knows what that will do. — Tim Roberts
But your code won't work a row at a time. Your adi and cov values need to apply to the entire dataframe. Right? But if that's the case, what is the rest of the code doing? Maybe you should describe the problem in words. — Tim Roberts
How would you even know if it's working or not with those bare excepts? — Axe319
I know that it is working because I monitor the progress using print and a percentage of completion. I'm cutting the dataframe into parts by looking at the cp_ref column and using it to pull out unique data that is more than just one row at a time. I need it to apply the function to the dataframe in these chunks. — Ravaal

Wouter Wouter · Accepted Answer · 2022-01-21T20:01:48

It looks like mean and std are the only calculations here so I'm guessing that this is the bottleneck.

You could try speeding it up with numba.

from numba import njit

@njit(parallel=True)
def numba_mean(x):
    return np.mean(x)

@njit(parallel=True)
def numba_std(x):
    return np.std(x)

cov = numba_std(dataframe['QUANTITY'].values) / numba_mean(dataframe['QUANTITY'].values)

How do I speed up applying a function to a large pandas dataframe?

1 Answers